MUSE: Mamba Is Efficient Multi-scale Learner For Text-video Retrieval
2024 Β· Haoran Tang, Meng Cao, Jinfa Huang, et al.
Abstract
Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.
Authors
(none)
Tags
Stats
Related papers
- Mamfusion: Multi-mamba With Temporal Fusion For Partially Relevant Video Retrieval (2025)1.69
- MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence (2025)1.40
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- CLIMP: Contrastive Language-image Mamba Pretraining (2026)0.00
- Bidirectional Likelihood Estimation With Multi-modal Large Language Models For Text-video Retrieval (2025)2.76
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- MURE: Hierarchical Multi-resolution Encoding Via Vision-language Models For Visual Document Retrieval (2026)0.00
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00