VALL-E 2: Neural Codec Language Models Are Human Parity Zero-shot Text To Speech Synthesizers
2024 Β· Sanyuan Chen, Shujie Liu, Long Zhou, et al.
Abstract
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, ev
Authors
(none)
Tags
Stats
Related papers
- VALL-E R: Robust And Efficient Zero-shot Text-to-speech Synthesis Via Monotonic Alignment (2024)0.00
- ELLA-V: Stable Neural Codec Language Modeling With Alignment-guided Sequence Reordering (2024)0.00
- Speak Foreign Languages With Your Own Voice: Cross-lingual Neural Codec Language Modeling (2023)0.00
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58
- VALL-T: Decoder-only Generative Transducer For Robust And Decoding-controllable Text-to-speech (2024)8.60
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00
- HALL-E: Hierarchical Neural Codec Language Model For Minute-long Zero-shot Text-to-speech Synthesis (2024)0.00
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00