Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models
2024 Β· Jisheng Bai, Haohe Liu, Mou Wang, et al.
Abstract
With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate potential hallucinations in the generated captions. Leveraging the AudioSet dataset and the proposed ap
Authors
(none)
Tags
Stats
Related papers
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Audiosetmix: Enhancing Audio-language Datasets With Llm-assisted Augmentations (2024)0.00
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74
- Wavcaps: A Chatgpt-assisted Weakly-labelled Audio Captioning Dataset For Audio-language Multimodal Research (2023)20.69
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning (2024)6.34
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00