Zero-shot Multi-speaker Text-to-speech With State-of-the-art Neural Speaker Embeddings
2019 Β· Erica Cooper, Cheng-I Lai, Yusuke Yasuda, et al.
Abstract
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.
Authors
(none)
Tags
Stats
Related papers
- Content-dependent Fine-grained Speaker Embedding For Zero-shot Speaker Adaptation In Text-to-speech Synthesis (2022)10.07
- Adapting End-to-end Neural Speaker Verification To New Languages And Recording Conditions With Adversarial Training (2018)9.59
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59
- Learning Speaker Embedding From Text-to-speech (2020)5.84
- On Deep Speaker Embeddings For Text-independent Speaker Recognition (2018)11.93
- Generalizable Zero-shot Speaker Adaptive Speech Synthesis With Disentangled Representations (2023)6.34
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- A Unified Speaker Adaptation Method For Speech Synthesis Using Transcribed And Untranscribed Speech With Backpropagation (2019)0.00