Speechx: Neural Codec Language Model As A Versatile Speech Transformer
2023 Β· Xiaofei Wang, Manthan Thakker, Zhuo Chen, et al.
Abstract
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving
Authors
(none)
Tags
Stats
Related papers
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Audiox: A Unified Framework For Anything-to-audio Generation (2025)0.00
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- Ssr-speech: Towards Stable, Safe And Robust Zero-shot Text-based Speech Editing And Synthesis (2024)2.26
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Speak Foreign Languages With Your Own Voice: Cross-lingual Neural Codec Language Modeling (2023)0.00
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58