Improve Multi-modal Embedding Learning Via Explicit Hard Negative Gradient Amplifying
2025 Β· Youze Xue, Dian Li, Gang Liu
Abstract
With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive,
Authors
(none)
Tags
Stats
Related papers
- Llave: Large Language And Vision Embedding Models With Hardness-weighted Contrastive Learning (2025)3.58
- Nv-retriever: Improving Text Embedding Models With Effective Hard-negative Mining (2024)0.00
- Mm-embed: Universal Multimodal Retrieval With Multimodal Llms (2024)0.00
- Unime-v2: Mllm-as-a-judge For Universal Multimodal Embedding Learning (2025)0.00
- VSE++: Improving Visual-semantic Embeddings With Hard Negatives (2017)0.00
- Loop: Looking For Optimal Hard Negative Embeddings For Deep Metric Learning (2021)8.82
- No Captions, No Problem: Captionless 3D-CLIP Alignment With Hard Negatives Via CLIP Knowledge And Llms (2024)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52