CFIR: Fast And Effective Long-text To Image Retrieval For Large Corpora
2024 Β· Zijun Long, Xuri Ge, Richard McCreadie, et al.
Abstract
Text-to-image retrieval aims to find the relevant images based on a text query, which is important in various use-cases, such as digital libraries, e-commerce, and multimedia databases. Although Multimodal Large Language Models (MLLMs) demonstrate state-of-the-art performance, they exhibit limitations in handling large-scale, diverse, and ambiguous real-world needs of retrieval, due to the computation cost and the injective embeddings they produce. This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective large-scale long-text to image retrieval. The first stage, Entity-based Ranking (ER), adapts to long-text query ambiguity by employing a multiple-queries-to-multiple-targets paradigm, facilitating candidate filtering for the next stage. The second stage, Summary-based Re-ranking (SR), refines these rankings using summarized queries. We also propose a specialized Decoupling-BEiT-3 encoder, optimized for handling ambiguous us
Authors
(none)
Tags
Stats
Related papers
- Flickr30k-cfq: A Compact And Fragmented Query Dataset For Text-image Retrieval (2024)3.58
- Fico-itr: Bridging Fine-grained And Coarse-grained Image-text Retrieval For Comparative Performance Analysis (2024)3.58
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Zero-shot Composed Text-image Retrieval (2023)0.00
- Infocir: Multimedia Analysis For Composed Image Retrieval (2026)1.24
- Interactive Text-to-image Retrieval With Large Language Models: A Plug-and-play Approach (2024)10.24
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Category-level Text-to-image Retrieval Improved: Bridging The Domain Gap With Diffusion Models And Vision Encoders (2025)1.20