Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions
2024 Β· Yi Yuan, Dongya Jia, Xiaobin Zhuang, et al.
Abstract
Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models
Authors
(none)
Tags
Stats
Related papers
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Emotioncaps: Enhancing Audio Captioning Through Emotion-augmented Data Generation (2024)0.00
- Taming Data And Transformers For Audio Generation (2024)0.00
- Wavcaps: A Chatgpt-assisted Weakly-labelled Audio Captioning Dataset For Audio-language Multimodal Research (2023)20.69
- Muscaps: Generating Captions For Music Audio (2021)9.59
- Performance Improvement Of Language-queried Audio Source Separation Based On Caption Augmentation From Large Language Models For DCASE Challenge 2024 Task 9 (2024)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74