Viclip-ot: The First Foundation Vision-language Model For Vietnamese Image-text Retrieval With Optimal Transport
2026 Β· Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
Abstract
Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage p
Authors
(none)
Tags
Stats
Related papers
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Lowclip: Adapting The CLIP Model Architecture For Low-resource Languages In Multimodal Image Retrieval Task (2024)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Mobileviclip: An Efficient Video-text Model For Mobile Devices (2025)2.76
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00