Accent Conversion In Text-to-speech Using Multi-level VAE And Adversarial Training
2024 Β· Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, et al.
Abstract
With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.
Authors
(none)
Tags
Stats
Related papers
- Accented Text-to-speech Synthesis With A Conditional Variational Autoencoder (2022)0.00
- Accent Conversion Using Discrete Units With Parallel Data Synthesized From Controllable Accented TTS (2024)0.00
- DART: Disentanglement Of Accent And Speaker Representation In Multispeaker Text-to-speech (2024)0.00
- Improving Accent Conversion With Reference Encoder And End-to-end Text-to-speech (2020)0.00
- Transfer The Linguistic Representations From TTS To Accent Conversion With Non-parallel Data (2024)6.77
- Accent-vits:accent Transfer For End-to-end TTS (2023)5.84
- Tts-guided Training For Accent Conversion Without Parallel Data (2022)8.60
- Accent And Speaker Disentanglement In Many-to-many Voice Conversion (2020)10.35