Consinger: Efficient High-fidelity Singing Voice Generation With Minimal Steps
2024 Β· Yulin Song, Guorui Sang, Jing Yu, et al.
Abstract
Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.
Authors
(none)
Tags
Stats
Related papers
- Diffsinger: Singing Voice Synthesis Via Shallow Diffusion Mechanism (2021)23.76
- Sifisinger: A High-fidelity End-to-end Singing Voice Synthesizer Based On Source-filter Model (2024)4.52
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Cssinger: End-to-end Chunkwise Streaming Singing Voice Synthesis System Based On Conditional Variational Autoencoder (2024)0.00
- Visinger: Variational Inference With Adversarial Learning For End-to-end Singing Voice Synthesis (2021)12.99
- Real-time And Accurate: Zero-shot High-fidelity Singing Voice Conversion With Multi-condition Flow Synthesis (2024)0.00
- Makesinger: A Semi-supervised Training Method For Data-efficient Singing Voice Synthesis Via Classifier-free Diffusion Guidance (2024)4.52
- Xiaoicesing: A High-quality And Integrated Singing Voice Synthesis System (2020)12.54