Voxgenesis: Unsupervised Discovery Of Latent Speaker Manifold For Speech Synthesis
2024 Β· Weiwei Lin, Chenhang He, Man-Wai Mak, et al.
Abstract
Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribu
Authors
(none)
Tags
Stats
Related papers
- Controllable Generation Of Artificial Speaker Embeddings Through Discovery Of Principal Directions (2023)0.00
- Unispeaker: A Unified Approach For Multimodality-driven Speaker Generation (2025)2.26
- Voxinstruct: Expressive Human Instruction-to-speech Generation With Unified Multilingual Codec Language Modelling (2024)11.81
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00
- Learning Latent Representations For Speech Generation And Transformation (2017)13.50
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- Deep Encoder-decoder Models For Unsupervised Learning Of Controllable Speech Synthesis (2018)0.00
- Semi-supervised Generative Modeling For Controllable Speech Synthesis (2019)0.00