Audiogen: Textually Guided Audio Generation
2022 Β· Felix Kreuk, Gabriel Synnaeve, Adam Polyak, et al.
Abstract
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple
Authors
(none)
Tags
Stats
Related papers
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Audiomog: Guiding Audio Generation With Mixture-of-guidance (2025)0.00
- Audiocomposer: Towards Fine-grained Audio Generation With Natural Language Descriptions (2024)5.24
- Taming Data And Transformers For Audio Generation (2024)0.00
- Diverse And Aligned Audio-to-video Generation Via Text-to-video Model Adaptation (2023)11.19
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76
- Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining (2023)0.00
- Ancogen: Analysis, Control And Generation Of Speech With A Masked Autoencoder (2025)6.34