Multimodal Representation Alignment For Cross-modal Information Retrieval
2025 Β· Fan Xu, Luis A. Leiva
Abstract
Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can serve as an informative measure of the modality gap, while cosine similarity consistently outperforms a
Authors
(none)
Tags
Stats
Related papers
- Maximal Matching Matters: Preventing Representation Collapse For Robust Cross-modal Retrieval (2025)2.26
- Towards Uniformity And Alignment For Multimodal Representation Learning (2026)0.00
- Aligning Multilingual Word Embeddings For Cross-modal Retrieval Task (2019)2.26
- Towards Cross-modal Text-molecule Retrieval With Better Modality Alignment (2024)4.52
- Preserving Semantic Neighborhoods For Robust Cross-modal Retrieval (2020)10.07
- Adversarial Cross-modal Retrieval Via Learning And Transferring Single-modal Similarities (2019)8.60
- Is Cross-modal Information Retrieval Possible Without Training? (2023)0.00
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00