Glm-4-voice: Towards Intelligent And Human-like End-to-end Spoken Chatbot
2024 Β· Aohan Zeng, Zhengxiao Du, Mingdao Liu, et al.
Abstract
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering.
Authors
(none)
Tags
Stats
Related papers
- Advancing Speech Language Models By Scaling Supervised Fine-tuning With Over 60,000 Hours Of Synthetic Speech Dialogue Data (2024)0.00
- Llama-omni2: Llm-based Real-time Spoken Chatbot With Autoregressive Streaming Speech Synthesis (2025)6.77
- Voila: Voice-language Foundation Models For Real-time Autonomous Interaction And Voice Role-play (2025)0.00
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents (2025)0.00
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Style-talker: Finetuning Audio Language Model And Style-based Text-to-speech Model For Fast Spoken Dialogue Generation (2024)0.00
- Spoken Conversational Agents With Large Language Models (2025)0.00