Tcsinger: Zero-shot Singing Voice Synthesis With Style Transfer And Multi-level Style Control
2024 Β· Yu Zhang, Ziyue Jiang, Ruiqi Li, et al.
Abstract
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration,
Authors
(none)
Tags
Stats
Related papers
- Stylesinger: Style Transfer For Out-of-domain Singing Voice Synthesis (2023)9.92
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Comelsinger: Discrete Token-based Zero-shot Singing Synthesis With Structured Melody Control And Guidance (2025)0.00
- Controlspeech: Towards Simultaneous And Independent Zero-shot Speaker Cloning And Zero-shot Language Style Control (2024)9.40
- Improving Data Augmentation-based Cross-speaker Style Transfer For TTS With Singing Voice, Style Filtering, And F0 Matching (2024)0.00
- Techsinger: Technique Controllable Multilingual Singing Voice Synthesis Via Flow Matching (2025)7.81
- ZSVC: Zero-shot Style Voice Conversion With Disentangled Latent Diffusion Models And Adversarial Training (2025)0.00
- Zero-shot Sing Voice Conversion: Built Upon Clustering-based Phoneme Representations (2024)0.00