X-pool: Cross-modal Language-video Attention For Text-video Retrieval
2022 Β· Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, et al.
Abstract
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a
Authors
(none)
Tags
Stats
Related papers
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Tencent Text-video Retrieval: Hierarchical Cross-modal Interactions With Multi-level Representations (2022)7.81
- Text-adaptive Multiple Visual Prototype Matching For Video-text Retrieval (2022)4.52
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43