Voxinstruct: Expressive Human Instruction-to-speech Generation With Unified Multilingual Codec Language Modelling
2024 Β· Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, et al.
Abstract
Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instru
Authors
(none)
Tags
Stats
Related papers
- Towards General-purpose Text-instruction-guided Voice Conversion (2023)0.00
- Voxgenesis: Unsupervised Discovery Of Latent Speaker Manifold For Speech Synthesis (2024)0.00
- Audiox: A Unified Framework For Anything-to-audio Generation (2025)0.00
- Speechx: Neural Codec Language Model As A Versatile Speech Transformer (2023)11.85
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00
- Speak Foreign Languages With Your Own Voice: Cross-lingual Neural Codec Language Modeling (2023)0.00
- Seeing What You Say: Expressive Image Generation From Speech (2025)0.00
- Unicats: A Unified Context-aware Text-to-speech Framework With Contextual Vq-diffusion And Vocoding (2023)10.35