Text-to-speech For Unseen Speakers Via Low-complexity Discrete Unit-based Frame Selection
2024 Β· Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, et al.
Abstract
Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A low-complexity alternative would broaden the reach of speech synthesis research, particularly in settings with limited computational and data resources. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly selecting frames from the target speaker's speech, SelectTTS enables generalization to unseen speakers w
Authors
(none)
Tags
Stats
Related papers
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- Stable-tts: Stable Speaker-adaptive Text-to-speech Synthesis Via Prosody Prompting (2024)4.52
- Analyzing Speech Unit Selection For Textless Speech-to-speech Translation (2024)0.00
- Knn Retrieval For Simple And Effective Zero-shot Multi-speaker Text-to-speech (2024)3.58
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- Speak, Read And Prompt: High-fidelity Text-to-speech With Minimal Supervision (2023)0.00
- Sample Efficient Adaptive Text-to-speech (2018)0.00
- TDASS: Target Domain Adaptation Speech Synthesis Framework For Multi-speaker Low-resource TTS (2022)0.00