Spokenwoz: A Large-scale Speech-text Benchmark For Spoken Task-oriented Dialogue Agents
2023 Β· Shuzheng Si, Wentao Ma, Haoyu Gao, et al.
Abstract
Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g.,
Authors
(none)
Tags
Stats
Related papers
- Sd-eval: A Benchmark Dataset For Spoken Dialogue Understanding Beyond Words (2024)11.32
- Dailytalk: Spoken Dialogue Dataset For Conversational Text-to-speech (2022)0.00
- Speculative End-turn Detector For Efficient Speech Chatbot Assistant (2025)0.00
- The Speech-llm Takes It All: A Truly Fully End-to-end Spoken Dialogue State Tracking Approach (2025)0.00
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Data-centric Improvements For Enhancing Multi-modal Understanding In Spoken Conversation Modeling (2024)0.00
- Speechrole: A Large-scale Dataset And Benchmark For Evaluating Speech Role-playing Agents (2025)1.91
- Wenetspeech4tts: A 12,800-hour Mandarin TTS Corpus For Large Speech Generation Model Benchmark (2024)9.76