Voila: Voice-language Foundation Models For Real-time Autonomous Interaction And Voice Role-play
2025 Β· Yemin Shi, Yu Shu, Siwei Dong, et al.
Abstract
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the spea
Authors
(none)
Tags
Stats
Related papers
- Funaudiollm: Voice Understanding And Generation Foundation Models For Natural Interaction Between Humans And Llms (2024)0.00
- Intrinsicvoice: Empowering Llms With Intrinsic Real-time Voice Interaction Abilities (2024)0.00
- Glm-4-voice: Towards Intelligent And Human-like End-to-end Spoken Chatbot (2024)7.00
- Spoken Conversational Agents With Large Language Models (2025)0.00
- Llama-omni2: Llm-based Real-time Spoken Chatbot With Autoregressive Streaming Speech Synthesis (2025)6.77
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Voiceloop: Voice Fitting And Synthesis Via A Phonological Loop (2017)0.00
- Voicecraft-dub: Automated Video Dubbing With Neural Codec Language Models (2025)0.00