Megapairs: Massive Data Synthesis For Universal Multimodal Retrieval
2024 Β· Junjie Zhou, Zheng Liu, Ze Liu, et al.
Abstract
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70\(\times\) more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highes
Authors
(none)
Tags
Stats
Related papers
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Multivent 2.0: A Massive Multilingual Benchmark For Event-centric Video Retrieval (2024)3.58
- Beyond Global Similarity: Towards Fine-grained, Multi-condition Multimodal Retrieval (2026)2.20
- Towards Universal Video Retrieval: Generalizing Video Embedding Via Synthesized Multimodal Pyramid Curriculum (2025)0.00
- Composed Multi-modal Retrieval: A Survey Of Approaches And Applications (2025)3.88