Wavcaps: A Chatgpt-assisted Weakly-labelled Audio Captioning Dataset For Audio-language Multimodal Research
2023 Β· Xinhao Mei, Chutong Meng, Haohe Liu, et al.
Abstract
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps
Authors
(none)
Tags
Stats
Related papers
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74
- Emotioncaps: Enhancing Audio Captioning Through Emotion-augmented Data Generation (2024)0.00
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- Crowdsourcing A Dataset Of Audio Captions (2019)8.60
- Learning Audio-video Modalities From Image Captions (2022)12.54
- RECAP: Retrieval-augmented Audio Captioning (2023)9.41