Mega-tts: Zero-shot Text-to-speech At Scale With Intrinsic Inductive Bias
2023 Β· Ziyue Jiang, Yi Ren, Zhenhui Ye, et al.
Abstract
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately
Authors
(none)
Tags
Stats
Related papers
- Megatts 3: Sparse Alignment Enhanced Latent Diffusion Transformer For Zero-shot Speech Synthesis (2025)0.00
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- HAM-TTS: Hierarchical Acoustic Modeling For Token-based Zero-shot Text-to-speech With Model And Data Scaling (2024)0.00
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data (2024)0.00
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Styletts-zs: Efficient High-quality Zero-shot Text-to-speech Synthesis With Distilled Time-varying Style Diffusion (2024)3.58
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00