Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval
2023 Β· Ziyang Luo, Pu Zhao, Can Xu, et al.
Abstract
Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders, however, it faces challenges with low retrieval speed in large-scale retrieval scenarios. In this work, we propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts to take advantage of the bag-of-words models and efficient inverted indexes, resulting in significantly reduced retrieval latency. A crucial gap arises from the continuous nature of image data, and the requirement for a sparse vocabulary space representation. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. This framework features lexicon-bottlenecked modules between the dual-stream encoders
Authors
(none)
Tags
Stats
Related papers
- Hivlp: Hierarchical Vision-language Pre-training For Fast Image-text Retrieval (2022)0.00
- Lexmae: Lexicon-bottlenecked Pretraining For Large-scale Retrieval (2022)0.00
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Lightningdot: Pre-training Visual-semantic Embeddings For Real-time Image-text Retrieval (2021)17.42
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Lotlip: Improving Language-image Pre-training For Long Text Understanding (2024)2.26
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- Imagebert: Cross-modal Pre-training With Large-scale Weak-supervised Image-text Data (2020)0.00