Mumur : Multilingual Multimodal Universal Retrieval
2022 Β· Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, et al.
Abstract
Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on a diverse set of retrieval datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and Multi30k . Experimental results demonstrate that our approach achieves state-of-the-art r
Authors
(none)
Tags
Stats
Related papers
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- MURAL: Multimodal, Multitask Retrieval Across Languages (2021)0.00
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence (2025)1.40
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00