Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval
2022 Β· Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, et al.
Abstract
This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source co
Authors
(none)
Tags
Stats
Related papers
- Unifier: A Unified Retriever For Large-scale Retrieval (2022)7.50
- Reasoning-augmented Representations For Multimodal Retrieval (2026)0.00
- MARVEL: Unlocking The Multi-modal Capability Of Dense Retrieval Via Visual Module Plugin (2023)9.04
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26
- Uniir: Training And Benchmarking Universal Multimodal Information Retrievers (2023)10.48
- Unicvr: From Alignment To Reranking For Unified Zero-shot Composed Visual Retrieval (2026)0.00
- Tevatron 2.0: Unified Document Retrieval Toolkit Across Scale, Language, And Modality (2025)3.58
- VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval (2024)16.73