DART: Disentanglement Of Accent And Speaker Representation In Multispeaker Text-to-speech
2024 Β· Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, et al.
Abstract
Recent advancements in Text-to-Speech (TTS) systems have enabled the generation of natural and expressive speech from textual input. Accented TTS aims to enhance user experience by making the synthesized speech more relatable to minority group listeners, and useful across various applications and context. Speech synthesis can further be made more flexible by allowing users to choose any combination of speaker identity and accent, resulting in a wide range of personalized speech outputs. Current models struggle to disentangle speaker and accent representation, making it difficult to accurately imitate different accents while maintaining the same speaker characteristics. We propose a novel approach to disentangle speaker and accent representations using multi-level variational autoencoders (ML-VAE) and vector quantization (VQ) to improve flexibility and enhance personalization in speech synthesis. Our proposed method addresses the challenge of effectively separating speaker and accent ch
Authors
(none)
Tags
Stats
Related papers
- Multi-scale Accent Modeling And Disentangling For Multi-speaker Multi-accent Text-to-speech Synthesis (2024)2.26
- Accent Conversion In Text-to-speech Using Multi-level VAE And Adversarial Training (2024)5.84
- Accent And Speaker Disentanglement In Many-to-many Voice Conversion (2020)10.35
- VANI: Very-lightweight Accent-controllable TTS For Native And Non-native Speakers With Identity Preservation (2023)3.58
- Accented Text-to-speech Synthesis With A Conditional Variational Autoencoder (2022)0.00
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81
- Cross-dialect Text-to-speech In Pitch-accent Language Incorporating Multi-dialect Phoneme-level BERT (2024)3.58
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97