Hiddensinger: High-quality Singing Voice Synthesis Via Neural Audio Codec And Latent Diffusion Models
2023 Β· Ji-Sang Hwang, Sang-Hoon Lee, Seong-Whan Lee
Abstract
Recently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, the application of diffusion models for synthesizing time-varying audio faces limitations in terms of complexity and controllability, as speech synthesis requires very high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in singing voice synthesis, we propose HiddenSinger, a high-quality singing voice synthesis system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice
Authors
(none)
Tags
Stats
Related papers
- Diffsinger: Singing Voice Synthesis Via Shallow Diffusion Mechanism (2021)23.76
- Mandarin Singing Voice Synthesis With Denoising Diffusion Probabilistic Wasserstein GAN (2022)6.34
- Singgan: Generative Adversarial Network For High-fidelity Singing Voice Generation (2021)10.61
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- Ddsp-based Singing Vocoders: A New Subtractive-based Synthesizer And A Comprehensive Evaluation (2022)0.00
- Cssinger: End-to-end Chunkwise Streaming Singing Voice Synthesis System Based On Conditional Variational Autoencoder (2024)0.00
- Zero-shot Duet Singing Voices Separation With Diffusion Models (2023)3.01
- Instructsing: High-fidelity Singing Voice Generation Via Instructing Yourself (2024)0.00