Generalized Contrastive Learning For Universal Multimodal Retrieval
2025 Β· Jungsoo Lee, Janghoon Cho, Hyojin Park, et al.
Abstract
Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all mod
Authors
(none)
Tags
Stats
Related papers
- Generalized Contrastive Learning For Multi-modal Retrieval And Ranking (2024)6.01
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- Normalized Contrastive Learning For Text-video Retrieval (2022)6.77
- GME: Improving Universal Multimodal Retrieval By Multimodal Llms (2024)0.00
- Cross-modal Coordination Across A Diverse Set Of Input Modalities (2024)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12