VSE++: Improving Visual-semantic Embeddings With Hard Negatives
2017 Β· Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, et al.
Abstract
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
Authors
(none)
Tags
Stats
Related papers
- Improving Visual-semantic Embeddings By Learning Semantically-enhanced Hard Negatives For Cross-modal Information Retrieval (2022)9.41
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- Improve Multi-modal Embedding Learning Via Explicit Hard Negative Gradient Amplifying (2025)2.80
- Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning (2025)3.58
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Dynamic Visual Semantic Sub-embeddings And Fast Re-ranking (2023)0.00
- Contrastive Learning Of Visual-semantic Embeddings (2021)0.00