Vevo2: A Unified And Controllable Framework For Speech And Singing Voice Generation
2025 Β· Xueyao Zhang, Junan Zhang, Yuancheng Wang, et al.
Abstract
Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during the speech-singing joint training of the AR model, we propose both explicit and implicit proso
Authors
(none)
Tags
Stats
Related papers
- Vevo: Controllable Zero-shot Voice Imitation With Self-supervised Disentanglement (2025)0.00
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Comelsinger: Discrete Token-based Zero-shot Singing Synthesis With Structured Melody Control And Guidance (2025)0.00
- Visinger: Variational Inference With Adversarial Learning For End-to-end Singing Voice Synthesis (2021)12.99
- Visinger2+: End-to-end Singing Voice Synthesis Augmented By Self-supervised Learning Representation (2024)4.52
- Learn To Sing By Listening: Building Controllable Virtual Singer By Unsupervised Learning From Voice Recordings (2023)0.00
- Cssinger: End-to-end Chunkwise Streaming Singing Voice Synthesis System Based On Conditional Variational Autoencoder (2024)0.00
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00