Mohammad Abu Sayem | Software Architect in Dhaka
Applied AI / Computer VisionCase Study

Evaluation of 13 state-of-the-art deep learning models across thousands of images with multiple fusion strategies

Dec 2025
12–15 minutes read
8+ months
Client

Internal Research Initiative (Industry-Inspired)

Location

Global

Role

Principal Software Architect

Scale

Enterprise

Sayem - Enhancing Image Similarity Detection Using Multi-Model Deep Learning Architectures

Key Metrics

Models Evaluated

13

Fusion Strategies

3

Best mAP Improvement

+0.03 to +0.04

Best Silhouette Score

0.249

Lowest Davies–Bouldin Index

1.93

Statistical Significance

p < 0.001

Executive Summary

This case study documents the design and evaluation of a multi-model image similarity framework that fuses embeddings from convolutional, transformer-based, and multimodal deep learning architectures. The work demonstrates that feature-level fusion significantly improves robustness, semantic alignment, and retrieval performance compared to single-model approaches.

The Problem

In practical image similarity and retrieval systems, reliance on a single deep learning architecture often leads to representational bias and inconsistent performance across diverse visual and semantic contexts. During industry proof-of-concept work, single-model approaches repeatedly failed to capture both low-level visual cues and high-level semantic relationships simultaneously. This highlighted the need for an architecture-agnostic similarity framework capable of leveraging complementary model strengths.

Objectives

Primary Objective

Design a robust, extensible image similarity framework that improves semantic consistency and retrieval accuracy through multi-model fusion.

Secondary Objectives

  • Evaluate architectural biases across CNNs, Vision Transformers, and multimodal models
  • Compare multiple fusion strategies using rigorous quantitative metrics
  • Assess embedding structure and semantic organization via visualization techniques
  • Derive architectural insights applicable to production-grade systems

Solution

A feature-level fusion framework was implemented to combine normalized embeddings from multiple pretrained deep learning models. Three fusion strategies-mean fusion, weighted fusion, and concatenation fusion-were evaluated to assess trade-offs between performance, dimensionality, and computational complexity. The system was designed to be modular, extensible, and reproducible.

Execution Details

Models Evaluated

  • CLIP (ViT-B/32)
  • ViT-B/16
  • ViT-B/16 (ImageNet-21k)
  • ResNet-50
  • ResNet-152
  • Inception V3
  • Inception V4
  • Inception-ResNet-V2
  • VGG-19
  • EfficientNet-B0
  • DenseNet-121
  • NASNet-Large
  • PNASNet-5Large

Fusion Strategies

  • Mean Fusion
  • Weighted Fusion
  • Concatenation Fusion

Embedding Processing

  • Pretrained feature extraction
  • L2 normalization for scale alignment
  • Optional dimensionality reduction for concatenation fusion

Evaluation Methods

  • Cosine similarity-based retrieval
  • Precision@K and mean Average Precision (mAP)
  • Silhouette Coefficient
  • Davies–Bouldin Index
  • t-SNE and UMAP visualization

Outcomes

Multi-model fusion consistently outperformed single-model baselines

Concatenation fusion delivered the highest retrieval accuracy

Weighted fusion achieved strong performance with lower dimensionality

Fused embeddings exhibited smoother and better-separated manifolds

Findings validated through both quantitative metrics and visual analysis

Impact

Technical Impact

The study confirmed that combining heterogeneous deep learning architectures leads to richer and more semantically coherent image representations. Feature-level fusion mitigated individual model biases and improved generalization.

Organizational Impact

The work was shared with organizational leadership, including the CEO, who appreciated the technical depth and real-world relevance. The initiative demonstrated how applied research can directly inform architectural decisions.

Professional Impact

Reinforced a research-informed approach to system design, bridging experimental evaluation with production-oriented thinking.

Lessons Learned

1

No single deep learning architecture is sufficient in isolation for robust image similarity.

2

Feature-level fusion preserves complementary information more effectively than late-stage aggregation.

3

Embedding visualizations are critical for diagnosing representational quality beyond numerical metrics.

4

Normalization and scale alignment are essential when combining heterogeneous embeddings.

5

Simple, empirically validated fusion strategies often outperform more complex, unproven alternatives.

6

Research rigor significantly improves confidence in architectural decisions.

What I'd Do Differently

Introduce learned fusion mechanisms, such as attention-based weighting, earlier in the evaluation pipeline.

Incorporate larger and more diverse datasets to further stress-test cross-domain generalization.

Explore adaptive dimensionality reduction techniques tailored to concatenated embeddings.

Integrate online evaluation to measure real-time system performance and latency trade-offs.

Allocate additional effort to interpretability tooling to better explain fusion behavior to non-technical stakeholders.

Next Steps

  • 1

    Refine the work for peer-reviewed journal publication

  • 2

    Extend fusion framework with learnable weighting mechanisms

  • 3

    Evaluate performance in production-scale retrieval scenarios

  • 4

    Explore cross-modal similarity extensions

Tech Stack

PythonPyTorchHugging Face TransformersNumPyscikit-learnUMAP-learnMatplotlibCUDA

Key Features

  • Multi-model embedding extraction
  • Feature-level fusion (mean, weighted, concatenation)
  • Cosine similarity–based retrieval
  • Statistical validation of improvements
  • Embedding space visualization
  • Extensible architecture for production systems
Enhancing Image Similarity Detection Using Multi-Model Deep Learning Architectures by Mohammad Abu Sayem | Software Architect in Dhaka | Mohammad Abu Sayem | Principal Software Architect | Technical Advisor | Expert Software Architect | Global Tech Leader | Enterprise AI Solution