Audio Retrieval With Natural Language Queries: A Benchmark Study
2021 Β· A. Sophia Koepke, Andreea-Maria Oncescu, JoΓ£o F. Henriques, et al.
Abstract
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-aud
Authors
(none)
Tags
Stats
Related papers
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Improving Natural-language-based Audio Retrieval With Transfer Learning And Audio & Text Augmentations (2022)0.00
- Dissecting Temporal Understanding In Text-to-audio Retrieval (2024)3.58
- Data Leakage In Cross-modal Retrieval Training: A Case Study (2023)5.84
- Speaker Retrieval In The Wild: Challenges, Effectiveness And Robustness (2025)2.26
- Voice-face Cross-modal Matching And Retrieval: A Benchmark (2019)0.00
- Matching Text And Audio Embeddings: Exploring Transfer-learning Strategies For Language-based Audio Retrieval (2022)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54