Speculative End-turn Detector For Efficient Speech Chatbot Assistant
2025 Β· Hyunjong Ok, Suho Yoo, Jaeho Lee
Abstract
Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-b
Authors
(none)
Tags
Stats
Related papers
- Sd-eval: A Benchmark Dataset For Spoken Dialogue Understanding Beyond Words (2024)11.32
- Spokenwoz: A Large-scale Speech-text Benchmark For Spoken Task-oriented Dialogue Agents (2023)2.26
- Exploring The Viability Of Synthetic Audio Data For Audio-based Dialogue State Tracking (2023)1.81
- Attentive Contextual Carryover For Multi-turn End-to-end Spoken Language Understanding (2021)7.16
- Hmm-based Data Augmentation For E2E Systems For Building Conversational Speech Synthesis Systems (2022)0.00
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26
- E-chat: Emotion-sensitive Spoken Dialogue System With Large Language Models (2023)7.50
- Using Speech Synthesis To Train End-to-end Spoken Language Understanding Models (2019)9.23