Auto-acd: A Large-scale Dataset For Audio-language Representation Learning
2023 Β· Luoyi Sun, Xuenan Xu, Mengyue Wu, et al.
Abstract
Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the following aspects: insufficient volume, simplistic content, and arduous collection procedures. To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We exploit a series of pre-trained models or APIs, to determine audio-visual synchronisation, generate image captions, object detection, or audio tags for specific videos. Subsequently, we employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues. To demonstrate the effectiveness of the proposed dataset, we train
Authors
(none)
Tags
Stats
Related papers
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Audiosetmix: Enhancing Audio-language Datasets With Llm-assisted Augmentations (2024)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Wavcaps: A Chatgpt-assisted Weakly-labelled Audio Captioning Dataset For Audio-language Multimodal Research (2023)20.69
- Audio Caption: Listen And Tell (2019)10.97