Speak Foreign Languages With Your Own Voice: Cross-lingual Neural Codec Language Modeling
2023 Β· Ziqiang Zhang, Long Zhou, Chengyi Wang, et al.
Abstract
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at https://aka.ms/vallex.
Authors
(none)
Tags
Stats
Related papers
- ELLA-V: Stable Neural Codec Language Modeling With Alignment-guided Sequence Reordering (2024)0.00
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58
- VALL-E R: Robust And Efficient Zero-shot Text-to-speech Synthesis Via Monotonic Alignment (2024)0.00
- VALL-E 2: Neural Codec Language Models Are Human Parity Zero-shot Text To Speech Synthesizers (2024)0.00
- Viola: Unified Codec Language Models For Speech Recognition, Synthesis, And Translation (2023)0.00
- VALL-T: Decoder-only Generative Transducer For Robust And Decoding-controllable Text-to-speech (2024)8.60
- Speechx: Neural Codec Language Model As A Versatile Speech Transformer (2023)11.85
- HALL-E: Hierarchical Neural Codec Language Model For Minute-long Zero-shot Text-to-speech Synthesis (2024)0.00