Vote-in-context: Turning Vlms Into Zero-shot Rank Fusers
2025 Β· Mohamed Eltahir, Ali Habibullah, Lama Ayash, et al.
Abstract
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over v
Authors
(none)
Tags
Stats
Related papers
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Unicvr: From Alignment To Reranking For Unified Zero-shot Composed Visual Retrieval (2026)0.00
- Generative Editing In The Joint Vision-language Space For Zero-shot Composed Image Retrieval (2025)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Vlm4rec: Multimodal Semantic Representation For Recommendation With Large Vision-language Models (2026)1.82
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00