Text-based Audio Retrieval By Learning From Similarities Between Audio Captions
2024 · Huang Xie, Khazar Khorrami, Okko Räsänen, et al.
Abstract
This paper proposes to use similarities of audio captions for estimating audio-caption relevances to be used for training text-based audio retrieval systems. Current audio-caption datasets (e.g., Clotho) contain audio samples paired with annotated captions, but lack relevance information about audio samples and captions beyond the annotated ones. Besides, mainstream approaches (e.g., CLAP) usually treat the annotated pairs as positives and consider all other audio-caption combinations as negatives, assuming a binary relevance between audio samples and captions. To infer the relevance between audio samples and arbitrary captions, we propose a method that computes non-binary audio-caption relevance scores based on the textual similarities of audio captions. We measure textual similarities of audio captions by calculating the cosine similarity of their Sentence-BERT embeddings and then transform these similarities into audio-caption relevance scores using a logistic function, thereby link
Authors
(none)
Tags
Stats
Related papers
- Estimated Audio-caption Correspondences Improve Language-based Audio Retrieval (2024)0.00
- Integrating Continuous And Binary Relevances In Audio-text Relevance Learning (2024)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Unsupervised Audio-caption Aligning Learns Correspondences Between Individual Sound Events And Textual Phrases (2021)8.09
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- RECAP: Retrieval-augmented Audio Captioning (2023)9.41
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00