Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions
2024 Β· Yuanyuan Wang, Hangting Chen, Dongchao Yang, et al.
Abstract
Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement for reference frame-level conditions. To address these challenges, we propose AudioComposer, a novel TTA generation framework that relies solely on natural language descriptions (NLDs) to provide both content specification and style control information. To further enhance audio generative modeling, we employ flow-based diffusion transformers with the cross-attention mechanism to incorporate text descriptions effectively into audio generation processes, which can not only simultaneously consider the content and style information in the text inputs, but also accelerate generation compared to oth
Authors
(none)
Tags
Stats
Related papers
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Auffusion: Leveraging The Power Of Diffusion And Large Language Models For Text-to-audio Generation (2024)11.19
- Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model (2023)0.00
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00
- ETTA: Elucidating The Design Space Of Text-to-audio Models (2024)0.00
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Audiorag+: Feedback-driven Retrieval-augmented Audio Generation With Large Audio Language Models (2025)0.00