Exploring Timbre Disentanglement In Non-autoregressive Cross-lingual Text-to-speech
2021 Β· Haoyue Zhan, Xinyuan Yu, Haitong Zhang, et al.
Abstract
In this paper, we study the disentanglement of speaker and language representations in non-autoregressive cross-lingual TTS models from various aspects. We propose a phoneme length regulator that solves the length mismatch problem between IPA input sequence and monolingual alignment results. Using the phoneme length regulator, we present a FastPitch-based cross-lingual model with IPA symbols as input representations. Our experiments show that language-independent input representations (e.g. IPA symbols), an increasing number of training speakers, and explicit modeling of speech variance information all encourage non-autoregressive cross-lingual TTS model to disentangle speaker and language representations. The subjective evaluation shows that our proposed model can achieve decent naturalness and speaker similarity in cross-language voice cloning.
Authors
(none)
Tags
Stats
Related papers
- Crossspeech: Speaker-independent Acoustic Representation For Cross-lingual Speech Synthesis (2023)7.16
- Genertts: Pronunciation Disentanglement For Timbre And Style Generalization In Cross-lingual Text-to-speech (2023)2.26
- Cross-dialect Text-to-speech In Pitch-accent Language Incorporating Multi-dialect Phoneme-level BERT (2024)3.58
- Multi-speaker Multi-style Speech Synthesis With Timbre And Style Disentanglement (2022)6.77
- Disentangled Representation Learning For Multilingual Speaker Recognition (2022)6.34
- Adapitch: Adaption Multi-speaker Text-to-speech Conditioned On Pitch Disentangling With Untranscribed Data (2022)0.00
- Cross-lingual Low Resource Speaker Adaptation Using Phonological Features (2021)5.24
- Speaker Disentanglement Of Speech Pre-trained Model Based On Interpretability (2025)0.00