Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts
2023 Β· Shun Lei, Yixuan Zhou, Liyang Chen, et al.
Abstract
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms base
Authors
(none)
Tags
Stats
Related papers
- VALL-E R: Robust And Efficient Zero-shot Text-to-speech Synthesis Via Monotonic Alignment (2024)0.00
- Speak Foreign Languages With Your Own Voice: Cross-lingual Neural Codec Language Modeling (2023)0.00
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- VALL-E 2: Neural Codec Language Models Are Human Parity Zero-shot Text To Speech Synthesizers (2024)0.00
- Expressive Prompting: Improving Emotion Intensity And Speaker Consistency In Zero-shot TTS (2024)0.00
- ELLA-V: Stable Neural Codec Language Modeling With Alignment-guided Sequence Reordering (2024)0.00
- HALL-E: Hierarchical Neural Codec Language Model For Minute-long Zero-shot Text-to-speech Synthesis (2024)0.00
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00