Rebalancing Contrastive Alignment With Bottlenecked Semantic Increments In Text-video Retrieval
2025 Β· Jian Xiao, Zijie Song, Jialong Hu, et al.
Abstract
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment \(\Delta_\{ij\}\) between text \(t_i\) and video \(v_j\), redistributing gradients to relieve optimization tension and absorb noise. We derive \(\Delta_\{ij\}\) via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize \(\Delta\)
Authors
(none)
Tags
Stats
Related papers
- Support-set Bottlenecks For Video-text Representation Learning (2020)0.00
- Normalized Contrastive Learning For Text-video Retrieval (2022)6.77
- Relevance-based Margin For Contrastively-trained Video Retrieval Models (2022)7.74
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- TC-MGC: Text-conditioned Multi-grained Contrastive Learning For Text-video Retrieval (2025)6.93
- Improving Video Retrieval By Adaptive Margin (2023)9.92
- Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval (2025)5.24
- Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval (2023)14.33