Audio Does Matter: Importance-aware Multi-granularity Fusion For Video Moment Retrieval
2025 Β· Junan Lin, Daizong Liu, Xianke Chen, et al.
Abstract
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance
Authors
(none)
Tags
Stats
Related papers
- Hybrid-learning Video Moment Retrieval Across Multi-domain Labels (2024)0.00
- Improving Video Corpus Moment Retrieval With Partial Relevance Enhancement (2024)7.89
- Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion (2025)0.00
- Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning (2024)3.58
- Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval (2020)13.28
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- Deep Music Retrieval For Fine-grained Videos By Exploiting Cross-modal-encoded Voice-overs (2021)6.34
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26