ELLA-V: Stable Neural Codec Language Modeling With Alignment-guided Sequence Reordering
2024 Β· Yakun Song, Zhuo Chen, Xiaofei Wang, et al.
Abstract
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms VAL
Authors
(none)
Tags
Stats
Related papers
- VALL-E R: Robust And Efficient Zero-shot Text-to-speech Synthesis Via Monotonic Alignment (2024)0.00
- Speak Foreign Languages With Your Own Voice: Cross-lingual Neural Codec Language Modeling (2023)0.00
- VALL-E 2: Neural Codec Language Models Are Human Parity Zero-shot Text To Speech Synthesizers (2024)0.00
- HALL-E: Hierarchical Neural Codec Language Model For Minute-long Zero-shot Text-to-speech Synthesis (2024)0.00
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- VALL-T: Decoder-only Generative Transducer For Robust And Decoding-controllable Text-to-speech (2024)8.60
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00