Text-adaptive Multiple Visual Prototype Matching For Video-text Retrieval
2022 Β· Chengzhi Lin, Ancong Wu, Junwei Liang, et al.
Abstract
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the ``Video-Text Correspondence Ambiguity'' problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (\textit\{e.g.\}, object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggre
Authors
(none)
Tags
Stats
Related papers
- Prototypes Are Balanced Units For Efficient And Effective Partially Relevant Video Retrieval (2025)0.00
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- Text-video Retrieval Via Variational Multi-modal Hypergraph Networks (2024)0.00
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12