Hivlp: Hierarchical Vision-language Pre-training For Fast Image-text Retrieval
2022 Β· Feilong Chen, Xiuyi Chen, Jiaxin Shi, et al.
Abstract
In the past few years, the emergence of vision-language pre-training (VLP) has brought cross-modal retrieval to a new era. However, due to the latency and computation demand, it is commonly challenging to apply VLP in a real-time online retrieval system. To alleviate the defect, this paper proposes a \textbf\{Hi\}erarchical \textbf\{V\}ision-\textbf\{\}Language \textbf\{P\}re-Training (\textbf\{HiVLP\}) for fast Image-Text Retrieval (ITR). Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR, i.e., using low-dimensional representation for large-scale coarse retrieval and high-dimensional representation for small-scale fine retrieval. We evaluate our proposed HiVLP on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO. Extensive experiments demonstrate that our HiVLP not only has fast inference speed but also can be easily scaled to large-scale ITR scenarios. The detailed results
Authors
(none)
Tags
Stats
Related papers
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Lightningdot: Pre-training Visual-semantic Embeddings For Real-time Image-text Retrieval (2021)17.42
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Fico-itr: Bridging Fine-grained And Coarse-grained Image-text Retrieval For Comparative Performance Analysis (2024)3.58
- Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment (2022)10.48
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16