Dissecting Temporal Understanding In Text-to-audio Retrieval
2024 Β· Andreea-Maria Oncescu, JoΓ£o F. Henriques, A. Sophia Koepke
Abstract
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.
Authors
(none)
Tags
Stats
Related papers
- Reversed In Time: A Novel Temporal-emphasized Benchmark For Cross-modal Video-text Retrieval (2024)6.52
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Audio Retrieval With Natural Language Queries: A Benchmark Study (2021)16.29
- Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval (2025)5.24
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Matching Text And Audio Embeddings: Exploring Transfer-learning Strategies For Language-based Audio Retrieval (2022)0.00
- Improving Natural-language-based Audio Retrieval With Transfer Learning And Audio & Text Augmentations (2022)0.00
- Temporal Perceiving Video-language Pre-training (2023)0.00