Abstract

Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment \(\Delta_\{ij\}\) between text \(t_i\) and video \(v_j\), redistributing gradients to relieve optimization tension and absorb noise. We derive \(\Delta_\{ij\}\) via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize \(\Delta\)

Authors

(none)

Tags

  • Image Retrieval

Stats

Related papers