Salmonn-omni: A Codec-free LLM For Full-duplex Speech Understanding And Generation
2024 Β· Wenyi Yu, Siyin Wang, Xiaoyu Yang, et al.
Abstract
Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (qu
Authors
(none)
Tags
Stats
Related papers
- Llama-omni: Seamless Speech Interaction With Large Language Models (2024)0.00
- Llama-omni2: Llm-based Real-time Spoken Chatbot With Autoregressive Streaming Speech Synthesis (2025)6.77
- Freeze-omni: A Smart And Low Latency Speech-to-speech Dialogue Model With Frozen LLM (2024)0.00
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Funaudiollm: Voice Understanding And Generation Foundation Models For Natural Interaction Between Humans And Llms (2024)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26