Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval
2025 Β· Jonghyun Song, Youngjune Lee, Gyu-Hwung Cho, et al.
Abstract
Vision-Language Pretrained (VLP) models have achieved impressive performance on multimodal tasks, including text-image retrieval, based on dense representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction in text-only settings due to its interpretability and efficiency with fast term-based lookup via inverted indexes. Inspired by these advantages, recent work has extended LSR to the multimodal domain. However, these methods often rely on computationally expensive contrastive pre-training, or distillation from a frozen dense model, which limits the potential for mutual enhancement. To address these limitations, we propose a simple yet effective framework that enables bi-directional learning between dense and sparse representations through Self-Knowledge Distillation. This bi-directional learning is achieved using an integrated similarity score-a weighted sum of dense and sparse similarities-which serves as a shared teacher signal for both representations. To ensure ef
Authors
(none)
Tags
Stats
Related papers
- Multimodal Learned Sparse Retrieval For Image Suggestion (2024)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- AMMKD: Adaptive Multimodal Multi-teacher Distillation For Lightweight Vision-language Models (2025)0.00
- Dynamic Contrastive Distillation For Image-text Retrieval (2022)11.76
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Distilling Vision-language Pretraining For Efficient Cross-modal Retrieval (2024)0.00
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00