Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning
2024 Β· Choi Changin, Lim Sungjun, Rhee Wonjong
Abstract
Retrieval-augmented generation can improve audio captioning by incorporating relevant audio-text pairs from a knowledge base. Existing methods typically rely solely on the input audio as a unimodal retrieval query. In contrast, we propose Generation-Assisted Multimodal Querying, which generates a text description of the input audio to enable multimodal querying. This approach aligns the query modality with the audio-text structure of the knowledge base, leading to more effective retrieval. Furthermore, we introduce a novel progressive learning strategy that gradually increases the number of interleaved audio-text pairs to enhance the training process. Our experiments on AudioCaps, Clotho, and Auto-ACD demonstrate that our approach achieves state-of-the-art results across these benchmarks.
Authors
(none)
Tags
Stats
Related papers
- RECAP: Retrieval-augmented Audio Captioning (2023)9.41
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Improving Audio-text Retrieval Via Hierarchical Cross-modal Interaction And Auxiliary Captions (2023)0.00
- Improving Natural-language-based Audio Retrieval With Transfer Learning And Audio & Text Augmentations (2022)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Retrieval-augmented Text-to-audio Generation (2023)0.00
- Introducing Auxiliary Text Query-modifier To Content-based Audio Retrieval (2022)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82