Copycat2: A Single Model For Multi-speaker TTS And Many-to-many Fine-grained Prosody Transfer
2022 Β· Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, et al.
Abstract
In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by \(22.7
Authors
(none)
Tags
Stats
Related papers
- Copycat: Many-to-many Fine-grained Prosody Transfer For Neural Text-to-speech (2020)11.76
- Ecat: An End-to-end Model For Multi-speaker TTS & Many-to-many Fine-grained Prosody Transfer (2023)0.00
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Cotatron: Transcription-guided Speech Encoder For Any-to-many Voice Conversion Without Parallel Data (2020)11.49
- M2-CTTS: End-to-end Multi-scale Multi-modal Conversational Text-to-speech Synthesis (2023)8.35
- Learning To Speak Fluently In A Foreign Language: Multilingual Speech Synthesis And Cross-language Voice Cloning (2019)15.03
- Towards End-to-end Prosody Transfer For Expressive Speech Synthesis With Tacotron (2018)0.00
- CAMP: A Two-stage Approach To Modelling Prosody In Context (2020)0.00