Unified Vision-language Modeling Via Concept Space Alignment
2026 Β· Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk
Abstract
We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language
Authors
(none)
Tags
Stats
Related papers
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Lost In Embeddings: Information Loss In Vision-language Models (2025)0.00
- Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding (2025)0.00
- Linear Spaces Of Meanings: Compositional Structures In Vision-language Models (2023)9.41