Style-talker: Finetuning Audio Language Model And Style-based Text-to-speech Model For Fast Spoken Dialogue Generation
2024 Β· Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, et al.
Abstract
The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio an
Authors
(none)
Tags
Stats
Related papers
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- End-to-end Text-to-speech Based On Latent Representation Of Speaking Styles Using Spontaneous Dialogue (2022)8.35
- STYLER: Style Factor Modeling With Rapidity And Robustness Via Speech Decomposition For Expressive And Controllable Neural Text To Speech (2021)9.23
- Generalized Multilingual Text-to-speech Generation With Language-aware Style Adaptation (2025)0.00
- Styletts 2: Towards Human-level Text-to-speech Through Style Diffusion And Adversarial Training With Large Speech Language Models (2023)8.09
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Autostyle-tts: Retrieval-augmented Generation Based Automatic Style Matching Text-to-speech Synthesis (2025)4.52