Speak, Read And Prompt: High-fidelity Text-to-speech With Minimal Supervision
2023 · Eugene Kharitonov, Damien Vincent, Zalán Borsos, et al.
Abstract
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of paral
Authors
(none)
Tags
Stats
Related papers
- Parrottts: Text-to-speech Synthesis By Exploiting Self-supervised Representations (2023)0.00
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- Stable-tts: Stable Speaker-adaptive Text-to-speech Synthesis Via Prosody Prompting (2024)4.52
- Generating Speakers By Prompting Listener Impressions For Pre-trained Multi-speaker Text-to-speech Systems (2024)3.58
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Mparrottts: Multilingual Multi-speaker Text To Speech Synthesis In Low Resource Setting (2023)0.00
- BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data (2024)0.00
- Contextspeech: Expressive And Efficient Text-to-speech For Paragraph Reading (2023)5.84