High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model
2024 Β· Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, et al.
Abstract
We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.
Authors
(none)
Tags
Stats
Related papers
- Single-stage TTS With Masked Audio Token Modeling And Semantic Knowledge Distillation (2024)0.00
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00
- Transduce And Speak: Neural Transducer For Text-to-speech With Semantic Token Prediction (2023)0.00
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- HAM-TTS: Hierarchical Acoustic Modeling For Token-based Zero-shot Text-to-speech With Model And Data Scaling (2024)0.00
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- Maskgct: Zero-shot Text-to-speech With Masked Generative Codec Transformer (2024)7.98