Category-level Text-to-image Retrieval Improved: Bridging The Domain Gap With Diffusion Models And Vision Encoders
2025 Β· Faizan Farooq Khan, Vladan StojniΔ, Zakaria Laskar, et al.
Abstract
This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir
Authors
(none)
Tags
Stats
Code
Related papers
- Extending CLIP For Category-to-image Retrieval In E-commerce (2021)8.60
- CFIR: Fast And Effective Long-text To Image Retrieval For Large Corpora (2024)7.16
- Multi-level CLS Token Fusion For Contrastive Learning In Endoscopy Image Classification (2025)0.00
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Calibclip: Contextual Calibration Of Dominant Semantics For Text-driven Image Retrieval (2025)0.00
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67