Combating Visual Neglect And Semantic Drift In Large Multimodal Models For Enhanced Cross-modal Retrieval

Abstract

arXiv:2604.25273v1 Announce Type: new Abstract: Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emph

Combating Visual Neglect And Semantic Drift In Large Multimodal Models For Enhanced Cross-modal Retrieval

Abstract

Authors

Tags

Stats

Related papers