Language-based Audio Moment Retrieval
2024 Β· Hokuto Munakata, Taichi Nishimura, Shota Nakada, et al.
Abstract
In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level
Authors
(none)
Tags
Stats
Related papers
- CASTELLA: Long Audio Dataset With Captions And Temporal Boundaries (2025)0.00
- SMART: Shot-aware Multimodal Video Moment Retrieval With Audio-enhanced MLLM (2025)0.00
- Contrastive Latent Space Reconstruction Learning For Audio-text Retrieval (2023)3.58
- Introducing Auxiliary Text Query-modifier To Content-based Audio Retrieval (2022)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Automated Audio Captioning And Language-based Audio Retrieval (2022)0.00
- Improving Natural-language-based Audio Retrieval With Transfer Learning And Audio & Text Augmentations (2022)0.00
- Retrieval-augmented Text-to-audio Generation (2023)0.00