Uniaudio: An Audio Foundation Model Toward Universal Audio Generation
2023 Β· Dongchao Yang, Jinchuan Tian, Xu Tan, et al.
Abstract
Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other condition modalities, 2) concatenates source-target pair as a single sequence, and 3) performs next-token prediction using LLM. Also, a multi-scale Transformer model is proposed to handle the overly long sequences caused by the residual vector quantization based neural codec in tokenization. Training of UniAudio is scaled up to 165K hours of audio and 1B parameters, based on all generative tasks, aiming to obtain sufficient prior knowledge not only in the intrinsic properties of audio but also the inter-relationship between audio and other modalities. Therefore, the trained UniAudio model has the
Authors
(none)
Tags
Stats
Related papers
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Unibrivl: Robust Universal Representation And Generation Of Audio Driven Diffusion Models (2023)2.26
- M\(^{2}\)ugen: Multi-modal Music Understanding And Generation With The Power Of Large Language Models (2023)0.00
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Audiox: A Unified Framework For Anything-to-audio Generation (2025)0.00
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Unispeaker: A Unified Approach For Multimodality-driven Speaker Generation (2025)2.26