Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models
2025 Β· Heyang Liu, Yuhao Wang, Ziyang Cheng, et al.
Abstract
Speech large language models (SpeechLLMs) have extended human-machine interactions from the text modality to the dynamic speech domain. Spoken dialogues convey diverse information, including semantic concepts, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the performance of distinct aspects, lacking a comprehensive comparison of critical capabilities between current routines. To address this gap, we propose VocalBench to assess the speech conversational abilities, comprising around 24k carefully curated instances of both English and Mandarin across four key dimensions - semantic quality, acoustic performance, conversational abilities, and robustness, covering 14 user-oriented characters. Experiments on 27 mainstream models reveal the common challenges for current routes, and highlight the need for new insights into next-generation speech int
Authors
(none)
Tags
Stats
Related papers
- VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents (2025)0.00
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- Voiceagentbench: Are Voice Assistants Ready For Agentic Tasks? (2025)1.20
- Sd-eval: A Benchmark Dataset For Spoken Dialogue Understanding Beyond Words (2024)11.32
- Dynamic-superb Phase-2: A Collaboratively Expanding Benchmark For Measuring The Capabilities Of Spoken Language Models With 180 Tasks (2024)4.61
- Speechllm-as-judges: Towards General And Interpretable Speech Quality Evaluation (2025)2.60
- Larabench: Benchmarking Arabic AI With Large Language Models (2023)6.77
- Paras2s: Benchmarking And Aligning Spoken Language Models For Paralinguistic-aware Speech-to-speech Interaction (2025)0.00