Funaudiollm: Voice Understanding And Generation Foundation Models For Natural Interaction Between Humans And Llms
2024 Β· Keyu An, Qian Chen, Chong Deng, et al.
Abstract
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applicati
Authors
(none)
Tags
Stats
Related papers
- Voila: Voice-language Foundation Models For Real-time Autonomous Interaction And Voice Role-play (2025)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Get Large Language Models Ready To Speak: A Late-fusion Approach For Speech Generation (2024)5.24
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41