Improving Visual-semantic Embeddings By Learning Semantically-enhanced Hard Negatives For Cross-modal Information Retrieval
2022 Β· Yan Gong, Georgina Cosma
Abstract
Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between
Authors
(none)
Tags
Stats
Related papers
- VSE++: Improving Visual-semantic Embeddings With Hard Negatives (2017)0.00
- Contrastive Learning Of Visual-semantic Embeddings (2021)0.00
- Dynamic Visual Semantic Sub-embeddings And Fast Re-ranking (2023)0.00
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning (2025)3.58
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Loop: Looking For Optimal Hard Negative Embeddings For Deep Metric Learning (2021)8.82