Prota: Probabilistic Token Aggregation For Text-video Retrieval
2024 Β· Han Fang, Xianghao Zang, Chao Ban, et al.
Abstract
Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeM
Authors
(none)
Tags
Stats
Related papers
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Text-adaptive Multiple Visual Prototype Matching For Video-text Retrieval (2022)4.52
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Video-colbert: Contextualized Late Interaction For Text-to-video Retrieval (2025)5.24
- Ambiguity-restrained Text-video Representation Learning For Partially Relevant Video Retrieval (2025)5.84