Lightweight Attentional Feature Fusion: A New Baseline For Text-to-video Retrieval
2021 Β· Fan Hu, Aozhu Chen, Ziyue Wang, et al.
Abstract
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.
Authors
(none)
Tags
Stats
Related papers
- Renmin University Of China At TRECVID 2022: Improving Video Search By Feature Fusion And Negation Understanding (2022)0.00
- Modality-agnostic Attention Fusion For Visual Search With Text Feedback (2020)0.00
- Audio-enhanced Text-to-video Retrieval Using Text-conditioned Feature Alignment (2023)11.08
- Continual Text-to-video Retrieval With Frame Fusion And Task-aware Routing (2025)8.75
- Cross-modal Search Method Of Technology Video Based On Adversarial Learning And Feature Fusion (2022)0.00
- Unifying Latent And Lexicon Representations For Effective Video-text Retrieval (2024)0.00
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- F4-ITS: Fine-grained Feature Fusion For Food Image-text Search (2025)1.40