Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding
2025 Β· Na Min An, Inha Kang, Minhyun Lee, et al.
Abstract
Vision-Language Encoders (VLEs) are widely adopted as the backbone of zero-shot referring image segmentation (RIS), enabling text-guided localization without task-specific training. However, prior works underexplored the underlying biases within mid-layer representations that preserve positional and language-specific information. Through layer-wise investigation, we reveal that the conventionally used final-layer multimodal embeddings prioritize global semantic alignment, leading to two coupled consequences. First, vision embeddings exhibit weak sensitivity to positional cues. Second, multilingual text embeddings form language-dependent geometric shifts within the shared space. Motivated by these findings, we identify an underexplored pathway within VLE mid-layers to construct a spatial map, applicable for improving zero-shot RIS by 1-7 mIoU on nine RefCOCO benchmarks. Furthermore, leveraging mixed-language mid-layer embeddings yields enhanced spatial grounding accuracy (+7-8 mIoU and
Authors
(none)
Tags
Stats
Related papers
- Lost In Embeddings: Information Loss In Vision-language Models (2025)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- Bendvlm: Test-time Debiasing Of Vision-language Embeddings (2024)4.52
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Addressing Bias In Vlms For Glaucoma Detection Without Protected Attribute Supervision (2025)0.00
- Language Features Matter: Effective Language Representations For Vision-language Tasks (2019)8.60
- Unveiling Deep Semantic Uncertainty Perception For Language-anchored Multi-modal Vision-brain Alignment (2025)0.00
- Unified Vision-language Modeling Via Concept Space Alignment (2026)0.00