Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction
2024 Β· Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, et al.
Abstract
We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similari
Authors
(none)
Tags
Stats
Related papers
- Transduce And Speak: Neural Transducer For Text-to-speech With Semantic Token Prediction (2023)0.00
- High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model (2024)4.52
- Single-stage TTS With Masked Audio Token Modeling And Semantic Knowledge Distillation (2024)0.00
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Transfer Learning From Speaker Verification To Multispeaker Text-to-speech Synthesis (2018)0.00
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00
- Textless Streaming Speech-to-speech Translation Using Semantic Speech Tokens (2024)3.58