Environment Aware Text-to-speech Synthesis
2021 Β· Daxin Tan, Guangyan Zhang, Tan Lee
Abstract
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples a
Authors
(none)
Tags
Stats
Related papers
- Incremental Disentanglement For Environment-aware Zero-shot Text-to-speech Synthesis (2024)2.26
- DAIEN-TTS: Disentangled Audio Infilling For Environment-aware Text-to-speech Synthesis (2025)0.00
- Describe Where You Are: Improving Noise-robustness For Speech Emotion Recognition With Text Description Of The Environment (2024)4.52
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Drspeech: Degradation-robust Text-to-speech Synthesis With Frame-level And Utterance-level Acoustic Representation Learning (2022)7.50
- Dynamic Prosody Generation For Speech Synthesis Using Linguistics-driven Acoustic Embedding Selection (2019)7.81
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- Investigating Context Features Hidden In End-to-end TTS (2018)0.00