When Vision Meets Texts In Listwise Reranking
2026 Β· Hongyi Cai
Abstract
Recent advancements in information retrieval have highlighted the potential of integrating visual and textual information, yet effective reranking for image-text documents remains challenging due to the modality gap and scarcity of aligned datasets. Meanwhile, existing approaches often rely on large models (7B to 32B parameters) with reasoning-based distillation, incurring unnecessary computational overhead while primarily focusing on textual modalities. In this paper, we propose Rank-Nexus, a multimodal image-text document reranker that performs listwise qualitative reranking on retrieved lists incorporating both images and texts. To bridge the modality gap, we introduce a progressive cross-modal training strategy. We first train modalities separately: leveraging abundant text reranking data, we distill knowledge into the text branch. For images, where data is scarce, we construct distilled pairs from multimodal large language model (MLLM) captions on image retrieval benchmarks. Subse
Authors
(none)
Tags
Stats
Related papers
- Learnable Pillar-based Re-ranking For Image-text Retrieval (2023)9.92
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Chain-of-thought Re-ranking For Image Retrieval Tasks (2025)1.81
- Matching Images And Text With Multi-modal Tensor Fusion And Re-ranking (2019)19.77
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Integrating Listwise Ranking Into Pairwise-based Image-text Retrieval (2023)9.16
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00