Combating Visual Neglect And Semantic Drift In Large Multimodal Models For Enhanced Cross-modal Retrieval
2026 Β· Guosheng Zhang, Linkai Liu, Keyao Wang, et al.
Abstract
arXiv:2604.25273v1 Announce Type: new Abstract: Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emph
Authors
(none)
Tags
Stats
Related papers
- CREM: Compression-driven Representation Enhancement For Multimodal Retrieval And Comprehension (2026)0.00
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Generative Giants, Retrieval Weaklings: Why Do Multimodal Large Language Models Fail At Multimodal Retrieval? (2025)0.00
- Preserving Semantic Neighborhoods For Robust Cross-modal Retrieval (2020)10.07
- Generative Cross-modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond (2024)8.35
- Beyond Global Similarity: Towards Fine-grained, Multi-condition Multimodal Retrieval (2026)2.20