Content-dependent Fine-grained Speaker Embedding For Zero-shot Speaker Adaptation In Text-to-speech Synthesis
2022 Β· Yixuan Zhou, Changhe Song, Xiang Li, et al.
Abstract
Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content, leading to poor speaker similarity in terms of detailed speaking styles and pronunciation habits. To improve the ability of the speaker encoder to model personal pronunciation characteristics, we propose content-dependent fine-grained speaker embedding for zero-shot speaker adaptation. The corresponding local content embeddings and speaker embeddings are extracted from a reference speech, respectively. Instead of modeling the temporal relations, a reference attention module is introduced to model the content relevance between the reference speech and the input text, and to generate the fine-
Authors
(none)
Tags
Stats
Related papers
- Zero-shot Multi-speaker Text-to-speech With State-of-the-art Neural Speaker Embeddings (2019)15.67
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59
- Generalizable Zero-shot Speaker Adaptive Speech Synthesis With Disentangled Representations (2023)6.34
- Zero-shot Personalized Lip-to-speech Synthesis With Face Image Based Voice Control (2023)5.84
- SEF-VC: Speaker Embedding Free Zero-shot Voice Conversion With Cross Attention (2023)0.00
- Towards Zero-shot Text-based Voice Editing Using Acoustic Context Conditioning, Utterance Embeddings, And Reference Encoders (2022)0.00
- Meta-tts: Meta-learning For Few-shot Speaker Adaptive Text-to-speech (2021)12.74
- Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters (2024)7.50