Mamfusion: Multi-mamba With Temporal Fusion For Partially Relevant Video Retrieval
2025 Β· Xinru Ying, Jiaqi Mo, Jingyang Lin, et al.
Abstract
Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale data
Authors
(none)
Tags
Stats
Related papers
- PRVR: Partially Relevant Video Retrieval (2022)2.26
- Gmmformer: Gaussian-mixture-model Based Transformer For Efficient Partially Relevant Video Retrieval (2023)12.06
- Uneven Event Modeling For Partially Relevant Video Retrieval (2025)1.40
- Ambiguity-restrained Text-video Representation Learning For Partially Relevant Video Retrieval (2025)5.84
- Mitigating Semantic Collapse In Partially Relevant Video Retrieval (2025)0.00
- Improving Video Corpus Moment Retrieval With Partial Relevance Enhancement (2024)7.89
- MUSE: Mamba Is Efficient Multi-scale Learner For Text-video Retrieval (2024)6.34
- MUVR: A Multi-modal Untrimmed Video Retrieval Benchmark With Multi-level Visual Correspondence (2025)1.40