Transduce And Speak: Neural Transducer For Text-to-speech With Semantic Token Prediction
2023 Β· Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, et al.
Abstract
We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.
Authors
(none)
Tags
Stats
Related papers
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00
- High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model (2024)4.52
- Textless Streaming Speech-to-speech Translation Using Semantic Speech Tokens (2024)3.58
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00
- Transfer Learning From Speaker Verification To Multispeaker Text-to-speech Synthesis (2018)0.00
- Text-to-speech Synthesis Based On Latent Variable Conversion Using Diffusion Probabilistic Model And Variational Autoencoder (2022)0.00
- Sample Efficient Adaptive Text-to-speech (2018)0.00