Audio-enhanced Text-to-video Retrieval Using Text-conditioned Feature Alignment
2023 Β· Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, et al.
Abstract
Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by ECLIPSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks tha
Authors
(none)
Tags
Stats
Related papers
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval (2025)5.24
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Tagging Before Alignment: Integrating Multi-modal Tags For Video-text Retrieval (2023)10.74
- TEACHTEXT: Crossmodal Generalized Distillation For Text-video Retrieval (2021)15.43
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- Video-colbert: Contextualized Late Interaction For Text-to-video Retrieval (2025)5.24