Incremental Disentanglement For Environment-aware Zero-shot Text-to-speech Synthesis
2024 Β· Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, et al.
Abstract
This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environment embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained environment-robust speaker encoder. Finally, both the speaker and environment embeddings are conditioned
Authors
(none)
Tags
Stats
Related papers
- DAIEN-TTS: Disentangled Audio Infilling For Environment-aware Text-to-speech Synthesis (2025)0.00
- Environment Aware Text-to-speech Synthesis (2021)6.34
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE (2022)5.84
- Disentangled Representation Learning For Environment-agnostic Speaker Recognition (2024)4.82
- Towards Zero-shot Text-based Voice Editing Using Acoustic Context Conditioning, Utterance Embeddings, And Reference Encoders (2022)0.00
- DART: Disentanglement Of Accent And Speaker Representation In Multispeaker Text-to-speech (2024)0.00
- Generalizable Zero-shot Speaker Adaptive Speech Synthesis With Disentangled Representations (2023)6.34