Taming Data And Transformers For Audio Generation
2024 Β· Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, et al.
Abstract
The scalability of ambient sound generators is hindered by data scarcity, insufficient caption quality, and limited scalability in model architecture. This work addresses these challenges by advancing both data and model scaling. First, we propose an efficient and scalable dataset collection pipeline tailored for ambient audio generation, resulting in AutoReCap-XL, the largest ambient audio-text dataset with over 47 million clips. To provide high-quality textual annotations, we propose AutoCap, a high-quality automatic audio captioning model. By adopting a Q-Former module and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of \(83.2\), a \(3.2%\) improvement over previous captioning models. Finally, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. We demonstrate its benefits from data scaling with synthetic captions as well as model size scaling. When compared to baseline
Authors
(none)
Tags
Stats
Related papers
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation (2023)3.58
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Emotioncaps: Enhancing Audio Captioning Through Emotion-augmented Data Generation (2024)0.00
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10