Get Large Language Models Ready To Speak: A Late-fusion Approach For Speech Generation
2024 Β· Maohao Shen, Shun Zhang, Jilong Wu, et al.
Abstract
Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.
Authors
(none)
Tags
Stats
Related papers
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Llasa: Scaling Train-time And Inference-time Compute For Llama-based Speech Synthesis (2025)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Llama-omni: Seamless Speech Interaction With Large Language Models (2024)0.00
- Enhancing Code-switched Text-to-speech Synthesis Capability In Large Language Models With Only Monolingual Corpora (2024)0.00