Copycat: Many-to-many Fine-grained Prosody Transfer For Neural Text-to-speech
2020 Β· Sri Karlapati, Alexis Moinet, Arnaud Joly, et al.
Abstract
Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through vario
Authors
(none)
Tags
Stats
Related papers
- Copycat2: A Single Model For Multi-speaker TTS And Many-to-many Fine-grained Prosody Transfer (2022)5.24
- Ecat: An End-to-end Model For Multi-speaker TTS & Many-to-many Fine-grained Prosody Transfer (2023)0.00
- Fine-grained Robust Prosody Transfer For Single-speaker Neural Text-to-speech (2019)0.00
- Prosody Transfer In Neural Text To Speech Using Global Pitch And Loudness Features (2019)0.00
- Cotatron: Transcription-guided Speech Encoder For Any-to-many Voice Conversion Without Parallel Data (2020)11.49
- Towards End-to-end Prosody Transfer For Expressive Speech Synthesis With Tacotron (2018)0.00
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Cross-speaker Style Transfer With Prosody Bottleneck In Neural Speech Synthesis (2021)10.21