Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining
2023 Β· Haohe Liu, Yi Yuan, Xubo Liu, et al.
Abstract
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioM
Authors
(none)
Tags
Stats
Related papers
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- BYOL For Audio: Self-supervised Learning For General-purpose Audio Representation (2021)15.22
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Audiogen: Textually Guided Audio Generation (2022)0.00
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Lauragpt: Listen, Attend, Understand, And Regenerate Audio With GPT (2023)0.00
- Mimo-audio: Audio Language Models Are Few-shot Learners (2025)6.03