Applied AI / Computer VisionCase Study

Evaluation of 13 state-of-the-art deep learning models across thousands of images with multiple fusion strategies

Dec 2025

12–15 minutes read

8+ months

Client

Internal Research Initiative (Industry-Inspired)

Location

Global

Role

Principal Software Architect

Scale

Enterprise

Sayem - Enhancing Image Similarity Detection Using Multi-Model Deep Learning Architectures

Key Metrics

Models Evaluated

Fusion Strategies

Best mAP Improvement

+0.03 to +0.04

Best Silhouette Score

0.249

Lowest Davies–Bouldin Index

1.93

Statistical Significance

p < 0.001

Executive Summary

This case study documents the design and evaluation of a multi-model image similarity framework that fuses embeddings from convolutional, transformer-based, and multimodal deep learning architectures. The work demonstrates that feature-level fusion significantly improves robustness, semantic alignment, and retrieval performance compared to single-model approaches.

The Problem

In practical image similarity and retrieval systems, reliance on a single deep learning architecture often leads to representational bias and inconsistent performance across diverse visual and semantic contexts. During industry proof-of-concept work, single-model approaches repeatedly failed to capture both low-level visual cues and high-level semantic relationships simultaneously. This highlighted the need for an architecture-agnostic similarity framework capable of leveraging complementary model strengths.

Objectives

Primary Objective

Design a robust, extensible image similarity framework that improves semantic consistency and retrieval accuracy through multi-model fusion.

Secondary Objectives

Evaluate architectural biases across CNNs, Vision Transformers, and multimodal models
Compare multiple fusion strategies using rigorous quantitative metrics
Assess embedding structure and semantic organization via visualization techniques
Derive architectural insights applicable to production-grade systems

Solution

A feature-level fusion framework was implemented to combine normalized embeddings from multiple pretrained deep learning models. Three fusion strategies-mean fusion, weighted fusion, and concatenation fusion-were evaluated to assess trade-offs between performance, dimensionality, and computational complexity. The system was designed to be modular, extensible, and reproducible.

Execution Details

Models Evaluated

CLIP (ViT-B/32)
ViT-B/16
ViT-B/16 (ImageNet-21k)
ResNet-50
ResNet-152
Inception V3
Inception V4
Inception-ResNet-V2
VGG-19
EfficientNet-B0
DenseNet-121
NASNet-Large
PNASNet-5Large

Fusion Strategies

Mean Fusion
Weighted Fusion
Concatenation Fusion

Embedding Processing

Pretrained feature extraction
L2 normalization for scale alignment
Optional dimensionality reduction for concatenation fusion

Evaluation Methods

Cosine similarity-based retrieval
Precision@K and mean Average Precision (mAP)
Silhouette Coefficient
Davies–Bouldin Index
t-SNE and UMAP visualization

Outcomes

Multi-model fusion consistently outperformed single-model baselines

Concatenation fusion delivered the highest retrieval accuracy

Weighted fusion achieved strong performance with lower dimensionality

Fused embeddings exhibited smoother and better-separated manifolds

Findings validated through both quantitative metrics and visual analysis

Impact

Technical Impact

The study confirmed that combining heterogeneous deep learning architectures leads to richer and more semantically coherent image representations. Feature-level fusion mitigated individual model biases and improved generalization.

Organizational Impact

The work was shared with organizational leadership, including the CEO, who appreciated the technical depth and real-world relevance. The initiative demonstrated how applied research can directly inform architectural decisions.

Professional Impact

Reinforced a research-informed approach to system design, bridging experimental evaluation with production-oriented thinking.

Lessons Learned

No single deep learning architecture is sufficient in isolation for robust image similarity.

Feature-level fusion preserves complementary information more effectively than late-stage aggregation.

Embedding visualizations are critical for diagnosing representational quality beyond numerical metrics.

Normalization and scale alignment are essential when combining heterogeneous embeddings.

Simple, empirically validated fusion strategies often outperform more complex, unproven alternatives.

Research rigor significantly improves confidence in architectural decisions.

What I'd Do Differently

Introduce learned fusion mechanisms, such as attention-based weighting, earlier in the evaluation pipeline.

Incorporate larger and more diverse datasets to further stress-test cross-domain generalization.

Explore adaptive dimensionality reduction techniques tailored to concatenated embeddings.

Integrate online evaluation to measure real-time system performance and latency trade-offs.

Allocate additional effort to interpretability tooling to better explain fusion behavior to non-technical stakeholders.

Next Steps

1
Refine the work for peer-reviewed journal publication
2
Extend fusion framework with learnable weighting mechanisms
3
Evaluate performance in production-scale retrieval scenarios
4
Explore cross-modal similarity extensions

Tech Stack

PythonPyTorchHugging Face TransformersNumPyscikit-learnUMAP-learnMatplotlibCUDA

Key Features

Multi-model embedding extraction
Feature-level fusion (mean, weighted, concatenation)
Cosine similarity–based retrieval
Statistical validation of improvements
Embedding space visualization
Extensible architecture for production systems

Architectural
Case Studies.

Principles
Forged in Production.

Technical Insights.

Enhancing Image Similarity Detection Using Multi-Model Deep Learning Architectures

Evaluation of 13 state-of-the-art deep learning models across thousands of images with multiple fusion strategies

Key Metrics

Executive Summary

The Problem

Objectives

Primary Objective

Secondary Objectives

Solution

Execution Details

Models Evaluated

Fusion Strategies

Embedding Processing

Evaluation Methods

Outcomes

Impact

Technical Impact

Organizational Impact

Professional Impact

Lessons Learned

What I'd Do Differently

Next Steps

Tech Stack

Key Features

Architectural Case Studies.

Principles Forged in Production.

Technical Insights.

Enhancing Image Similarity Detection Using Multi-Model Deep Learning Architectures

Evaluation of 13 state-of-the-art deep learning models across thousands of images with multiple fusion strategies

Key Metrics

Executive Summary

The Problem

Objectives

Primary Objective

Secondary Objectives

Solution

Execution Details

Models Evaluated

Fusion Strategies

Embedding Processing

Evaluation Methods

Outcomes

Impact

Technical Impact

Organizational Impact

Professional Impact

Lessons Learned

What I'd Do Differently

Next Steps

Tech Stack

Key Features

Architectural
Case Studies.

Principles
Forged in Production.