Estimated Audio-caption Correspondences Improve Language-based Audio Retrieval
2024 Β· Paul Primus, Florian Schmid, Gerhard Widmer
Abstract
Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption
Authors
(none)
Tags
Stats
Related papers
- Text-based Audio Retrieval By Learning From Similarities Between Audio Captions (2024)2.26
- Unsupervised Audio-caption Aligning Learns Correspondences Between Individual Sound Events And Textual Phrases (2021)8.09
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning (2024)3.58
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Introducing Auxiliary Text Query-modifier To Content-based Audio Retrieval (2022)0.00