Modest-align: Data-efficient Alignment For Vision-language Models
2025 Β· Jiaxiang Liu, Yuan Wang, Jiawei Du, et al.
Abstract
Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce
Authors
(none)
Tags
Stats
Related papers
- Variance-aware Loss Scheduling For Multimodal Alignment In Low-data Settings (2025)0.00
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- Curriculum Learning For Data-efficient Vision-language Alignment (2022)2.26
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Isoclip: Decomposing CLIP Projectors For Efficient Intra-modal Alignment (2026)3.06
- Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment (2025)3.01