Dynamic-superb Phase-2: A Collaboratively Expanding Benchmark For Measuring The Capabilities Of Spoken Language Models With 180 Tasks
2024 Β· Chien-Yu Huang, Wei-Chih Chen, Shu-Wen Yang, et al.
Abstract
Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introdu
Authors
(none)
Tags
Stats
Related papers
- Dynamic-superb: Towards A Dynamic, Collaborative, And Comprehensive Instruction-tuning Benchmark For Speech (2023)0.00
- SUPERB-SG: Enhanced Speech Processing Universal Performance Benchmark For Semantic And Generative Capabilities (2022)13.34
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Findings Of The 2023 ML-SUPERB Challenge: Pre-training And Evaluation Over More Languages And Beyond (2023)0.00
- ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, And Datasets (2024)4.52
- ML-SUPERB: Multilingual Speech Universal Performance Benchmark (2023)12.47
- Roadmap Towards Superhuman Speech Understanding Using Large Language Models (2024)0.00
- MMSU: A Massive Multi-task Spoken Language Understanding And Reasoning Benchmark (2025)2.29