Makesinger: A Semi-supervised Training Method For Data-efficient Singing Voice Synthesis Via Classifier-free Diffusion Guidance
2024 Β· Semin Kim, Myeonghun Jeong, Hyeonseung Lee, et al.
Abstract
In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semi-supervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.
Authors
(none)
Tags
Stats
Related papers
- Visinger2+: End-to-end Singing Voice Synthesis Augmented By Self-supervised Learning Representation (2024)4.52
- Diffsinger: Singing Voice Synthesis Via Shallow Diffusion Mechanism (2021)23.76
- LDM-SVC: Latent Diffusion Model Based Zero-shot Any-to-any Singing Voice Conversion With Singer Guidance (2024)5.84
- Singaug: Data Augmentation For Singing Voice Synthesis With Cycle-consistent Training Strategy (2022)7.16
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Consinger: Efficient High-fidelity Singing Voice Generation With Minimal Steps (2024)2.26
- Enhancing The Vocal Range Of Single-speaker Singing Voice Synthesis With Melody-unsupervised Pre-training (2023)3.58
- Semi-supervised Learning For Singing Synthesis Timbre (2020)3.58