UATVR: Uncertainty-adaptive Text-video Retrieval
2023 Β· Bo Fang, Wenhao Wu, Chang Liu, et al.
Abstract
With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pair
Authors
(none)
Tags
Stats
Related papers
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- Ambiguity-restrained Text-video Representation Learning For Partially Relevant Video Retrieval (2025)5.84
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Delving Deeper: Hierarchical Visual Perception For Robust Video-text Retrieval (2026)1.24
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57
- Text-adaptive Multiple Visual Prototype Matching For Video-text Retrieval (2022)4.52
- Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval (2025)5.24
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95