Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data
2025 Β· Dahyun Chung, Donghyun Shin, Yujin Sung, et al.
Abstract
Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even
Authors
(none)
Tags
Stats
Related papers
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Efficientclip: Efficient Cross-modal Pre-training By Ensemble Confident Learning And Language Modeling (2021)0.00
- CLIP-PING: Boosting Lightweight Vision-language Models With Proximus Intrinsic Neighbors Guidance (2024)0.00
- Contrastive Language-image Pre-training For The Italian Language (2021)0.00
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42