Comelsinger: Discrete Token-based Zero-shot Singing Synthesis With Structured Melody Control And Guidance
2025 Β· Junchuan Zhao, Wei Zeng, Tianle Lyu, et al.
Abstract
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy th
Authors
(none)
Tags
Stats
Related papers
- Tcsinger: Zero-shot Singing Voice Synthesis With Style Transfer And Multi-level Style Control (2024)7.16
- Everyone-can-sing: Zero-shot Singing Voice Synthesis And Conversion With Speech Reference (2025)0.00
- Sifisinger: A High-fidelity End-to-end Singing Voice Synthesizer Based On Source-filter Model (2024)4.52
- Karasinger: Score-free Singing Voice Synthesis With VQ-VAE Using Mel-spectrograms (2021)2.26
- Prompt-singer: Controllable Singing-voice-synthesis With Natural Language Prompt (2024)6.77
- Vevo2: A Unified And Controllable Framework For Speech And Singing Voice Generation (2025)0.00
- A Melody-unsupervision Model For Singing Voice Synthesis (2021)5.84
- Diffsinger: Singing Voice Synthesis Via Shallow Diffusion Mechanism (2021)23.76