Low-resource Cross-domain Singing Voice Synthesis Via Reduced Self-supervised Speech Representations
2024 Β· Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, et al.
Abstract
In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and fol
Authors
(none)
Tags
Stats
Related papers
- Karaoker: Alignment-free Singing Voice Synthesis With Speech Training Data (2022)2.26
- Karasinger: Score-free Singing Voice Synthesis With VQ-VAE Using Mel-spectrograms (2021)2.26
- Toward Leveraging Pre-trained Self-supervised Frontends For Automatic Singing Voice Understanding Tasks: Three Case Studies (2023)0.00
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Visinger2+: End-to-end Singing Voice Synthesis Augmented By Self-supervised Learning Representation (2024)4.52
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- Adversarially Trained Multi-singer Sequence-to-sequence Singing Synthesizer (2020)7.81
- Semi-supervised Learning For Singing Synthesis Timbre (2020)3.58