Boosting Large Language Model For Speech Synthesis: An Empirical Study
2023 · Hongkun Hao, Long Zhou, Shujie Liu, et al.
Abstract
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and V
Authors
(none)
Tags
Stats
Related papers
- Get Large Language Models Ready To Speak: A Late-fusion Approach For Speech Generation (2024)5.24
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Enhancing Code-switched Text-to-speech Synthesis Capability In Large Language Models With Only Monolingual Corpora (2024)0.00
- From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition (2025)0.00
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59