Ecat: An End-to-end Model For Multi-speaker TTS & Many-to-many Fine-grained Prosody Transfer
2023 Β· Ammar Abbas, Sri Karlapati, Bastian Schnell, et al.
Abstract
We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference.
Authors
(none)
Tags
Stats
Related papers
- Copycat2: A Single Model For Multi-speaker TTS And Many-to-many Fine-grained Prosody Transfer (2022)5.24
- Copycat: Many-to-many Fine-grained Prosody Transfer For Neural Text-to-speech (2020)11.76
- META-CAT: Speaker-informed Speech Embeddings Via Meta Information Concatenation For Multi-talker ASR (2024)3.58
- Unicats: A Unified Context-aware Text-to-speech Framework With Contextual Vq-diffusion And Vocoding (2023)10.35
- CAT: A CTC-CRF Based ASR Toolkit Bridging The Hybrid And The End-to-end Approaches Towards Data Efficiency And Low Latency (2020)9.03
- M2-CTTS: End-to-end Multi-scale Multi-modal Conversational Text-to-speech Synthesis (2023)8.35
- Cotatron: Transcription-guided Speech Encoder For Any-to-many Voice Conversion Without Parallel Data (2020)11.49
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77