Lost In Embeddings: Information Loss In Vision-language Models
2025 Β· Wenyan Li, Raphael Tang, Chengzu Li, et al.
Abstract
Vision--language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60% post-projection, correlating with degradation in
Authors
(none)
Tags
Stats
Related papers
- Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding (2025)0.00
- Linear Spaces Of Meanings: Compositional Structures In Vision-language Models (2023)9.41
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Probvlm: Probabilistic Adapter For Frozen Vision-language Models (2023)13.41
- Analyzing Diffusion And Autoregressive Vision Language Models In Multimodal Embedding Space (2026)0.00
- ARGENT: Adaptive Hierarchical Image-text Representations (2026)0.00
- Bendvlm: Test-time Debiasing Of Vision-language Embeddings (2024)4.52
- Unified Vision-language Modeling Via Concept Space Alignment (2026)0.00