Supervised Fine-tuning Or Contrastive Learning? Towards Better Multimodal LLM Reranking
2025 Β· Ziqi Dai, Xin Zhang, Mingxin Li, et al.
Abstract
In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experim
Authors
(none)
Tags
Stats
Related papers
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- MLLM Is A Strong Reranker: Advancing Multimodal Retrieval-augmented Generation Via Knowledge-enhanced Reranking And Noise-injected Training (2024)9.18
- Generalized Contrastive Learning For Multi-modal Retrieval And Ranking (2024)6.01
- Rebol: Retrieval Via Bayesian Optimization With Batched LLM Relevance Observations And Query Reformulation (2026)0.00
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- What Drives Cross-lingual Ranking? Retrieval Approaches With Multilingual Language Models (2025)0.00