Unified Loss Of Pair Similarity Optimization For Vision-language Retrieval
2022 Β· Zheng Li, Caili Guo, Xin Wang, et al.
Abstract
There are two popular loss functions used for vision-language retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for vision-language retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used
Authors
(none)
Tags
Stats
Related papers
- Equivariant Similarity For Vision-language Foundation Models (2023)13.78
- Two-stage Triplet Loss Training With Curriculum Augmentation For Audio-visual Retrieval (2023)0.00
- Comparing Contrastive And Triplet Loss: Variance Analysis And Optimization Behavior (2025)0.00
- Contrastive Learning Of Visual-semantic Embeddings (2021)0.00
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Dissecting The Impact Of Different Loss Functions With Gradient Surgery (2022)0.00
- Sparse And Dense Retrievers Learn Better Together: Joint Sparse-dense Optimization For Text-image Retrieval (2025)0.00
- Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning (2025)3.58