Loopitr: Combining Dual And Cross Encoder Architectures For Image-text Retrieval
2022 Β· Jie Lei, Xinlei Chen, Ning Zhang, et al.
Abstract
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of training the two encoders in the same network, and demonstrate that distillation can be quite effective
Authors
(none)
Tags
Stats
Related papers
- How To Make Cross Encoder A Good Teacher For Efficient Image-text Retrieval? (2024)5.24
- CODER: Coupled Diversity-sensitive Momentum Contrastive Learning For Image-text Retrieval (2022)13.72
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Intra-modal Constraint Loss For Image-text Retrieval (2022)8.33
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00