A Preliminary Investigation On Flexible Singing Voice Synthesis Through Decomposed Framework With Inferrable Features
2024 Β· Lester Phillip Violeta, Taketo Akama
Abstract
We investigate the feasibility of a singing voice synthesis (SVS) system by using a decomposed framework to improve flexibility in generating singing voices. Due to data-driven approaches, SVS performs a music score-to-waveform mapping; however, the direct mapping limits control, such as being able to only synthesize in the language or the singers present in the labeled singing datasets. As collecting large singing datasets labeled with music scores is an expensive task, we investigate an alternative approach by decomposing the SVS system and inferring different singing voice features. We decompose the SVS system into three-stage modules of linguistic, pitch contour, and synthesis, in which singing voice features such as linguistic content, F0, voiced/unvoiced, singer embeddings, and loudness are directly inferred from audio. Through this decomposed framework, we show that we can alleviate the labeled dataset requirements, adapt to different languages or singers, and inpaint the lyrica
Authors
(none)
Tags
Stats
Related papers
- Sifisinger: A High-fidelity End-to-end Singing Voice Synthesizer Based On Source-filter Model (2024)4.52
- Visinger2+: End-to-end Singing Voice Synthesis Augmented By Self-supervised Learning Representation (2024)4.52
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Visinger: Variational Inference With Adversarial Learning For End-to-end Singing Voice Synthesis (2021)12.99
- Diffsinger: Singing Voice Synthesis Via Shallow Diffusion Mechanism (2021)23.76
- Enhancing The Vocal Range Of Single-speaker Singing Voice Synthesis With Melody-unsupervised Pre-training (2023)3.58
- Semi-supervised Learning For Singing Synthesis Timbre (2020)3.58
- Leveraging Diverse Semantic-based Audio Pretrained Models For Singing Voice Conversion (2023)0.00