Llama-omni2: Llm-based Real-time Spoken Chatbot With Autoregressive Streaming Speech Synthesis
2025 Β· Qingkai Fang, Yan Zhou, Shoutao Guo, et al.
Abstract
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
Authors
(none)
Tags
Stats
Related papers
- Llama-omni: Seamless Speech Interaction With Large Language Models (2024)0.00
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Salmonn-omni: A Codec-free LLM For Full-duplex Speech Understanding And Generation (2024)0.00
- Freeze-omni: A Smart And Low Latency Speech-to-speech Dialogue Model With Frozen LLM (2024)0.00
- Advancing Speech Language Models By Scaling Supervised Fine-tuning With Over 60,000 Hours Of Synthetic Speech Dialogue Data (2024)0.00
- Glm-4-voice: Towards Intelligent And Human-like End-to-end Spoken Chatbot (2024)7.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- PSLM: Parallel Generation Of Text And Speech With Llms For Low-latency Spoken Dialogue Systems (2024)2.26