Attentron: Few-shot Text-to-speech Utilizing Attention-based Variable-length Embedding
2020 Β· Seungwoo Choi, Seungju Han, Dongyoung Kim, et al.
Abstract
On account of growing demands for personalization, the need for a so-called few-shot TTS system that clones speakers with only a few data is emerging. To address this issue, we propose Attentron, a few-shot TTS model that clones voices of speakers unseen during training. It introduces two special encoders, each serving different purposes. A fine-grained encoder extracts variable-length style information via an attention mechanism, and a coarse-grained encoder greatly stabilizes the speech synthesis, circumventing unintelligible gibberish even for synthesizing speech of unseen speakers. In addition, the model can scale out to an arbitrary number of reference audios to improve the quality of the synthesized speech. According to our experiments, including a human evaluation, the proposed model significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.
Authors
(none)
Tags
Stats
Related papers
- Small-e: Small Language Model With Linear Attention For Efficient Speech Synthesis (2024)9.02
- Meta-tts: Meta-learning For Few-shot Speaker Adaptive Text-to-speech (2021)12.74
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29
- Cross-lingual Multi-speaker Text-to-speech Synthesis For Voice Cloning Without Using Parallel Corpus For Unseen Speakers (2019)0.00
- Neural Voice Cloning With A Few Samples (2018)0.00
- Translatotron 2: High-quality Direct Speech-to-speech Translation With Voice Preservation (2021)0.00