Rematch: Boosting Representation Through Matching For Multimodal Retrieval
2025 Β· Qianying Liu, Xiao Liang, Zhiqiang Zhang, et al.
Abstract
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance base
Authors
(none)
Tags
Stats
Related papers
- Recurrence Meets Transformers For Universal Multimodal Retrieval (2025)2.41
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- RETLLM: Training And Data-free Mllms For Multimodal Information Retrieval (2026)1.57
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Learning To Rematch Mismatched Pairs For Robust Cross-modal Retrieval (2024)13.82
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52
- Freeret: Mllms As Training-free Retrievers (2025)0.00
- Generative Giants, Retrieval Weaklings: Why Do Multimodal Large Language Models Fail At Multimodal Retrieval? (2025)0.00