VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents
2025 Β· Jiliang Hu, Wenfu Wang, Zuchao Li, et al.
Abstract
Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for adv
Authors
(none)
Tags
Stats
Related papers
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- Voiceagentbench: Are Voice Assistants Ready For Agentic Tasks? (2025)1.20
- Mtavg-bench: A Comprehensive Benchmark For Evaluating Multi-talker Dialogue-centric Audio-video Generation (2026)0.00
- Who Can Withstand Chat-audio Attacks? An Evaluation Benchmark For Large Audio-language Models (2024)2.26
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Larabench: Benchmarking Arabic AI With Large Language Models (2023)6.77
- Speaker Verification In Agent-generated Conversations (2024)0.00