Freeze-omni: A Smart And Low Latency Speech-to-speech Dialogue Model With Frozen LLM
2024 Β· Xiong Wang, Yangze Li, Chaoyou Fu, et al.
Abstract
Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the
Authors
(none)
Tags
Stats
Related papers
- Llama-omni: Seamless Speech Interaction With Large Language Models (2024)0.00
- Frozen Large Language Models Can Perceive Paralinguistic Aspects Of Speech (2024)6.34
- Advancing Speech Language Models By Scaling Supervised Fine-tuning With Over 60,000 Hours Of Synthetic Speech Dialogue Data (2024)0.00
- Llama-omni2: Llm-based Real-time Spoken Chatbot With Autoregressive Streaming Speech Synthesis (2025)6.77
- Salmonn-omni: A Codec-free LLM For Full-duplex Speech Understanding And Generation (2024)0.00
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- PSLM: Parallel Generation Of Text And Speech With Llms For Low-latency Spoken Dialogue Systems (2024)2.26