MASS: Overcoming Language Bias In Image-text Matching
2025 Β· Jiwan Chung, Seungwon Lim, Sangkyu Lee, et al.
Abstract
Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
Authors
(none)
Tags
Stats
Related papers
- Multilingual Diversity Improves Vision-language Representations (2024)2.26
- Bringing Multimodality To Amazon Visual Search System (2024)6.34
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- VLMAE: Vision-language Masked Autoencoder (2022)0.00
- Multilingual-to-multimodal (M2M): Unlocking New Languages With Monolingual Text (2026)0.00
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74