Cosyaudio: Improving Audio Generation With Confidence Scores And Synthetic Captions
2025 Β· Xinfa Zhu, Wenjie Tian, Xinsheng Wang, et al.
Abstract
Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that iteratively optimizes CosyAudio across both well-labeled and weakly-labeled datasets. Initially trai
Authors
(none)
Tags
Stats
Related papers
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Taming Data And Transformers For Audio Generation (2024)0.00
- ETTA: Elucidating The Design Space Of Text-to-audio Models (2024)0.00
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions (2024)5.24
- Synthio: Augmenting Small-scale Audio Classification Datasets With Synthetic Data (2024)0.00
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00