Teaching BERT To Wait: Balancing Accuracy And Latency For Streaming Disfluency Detection
2022 Β· Angelica Chen, Vicky Zayats, Daniel D. Walker, et al.
Abstract
In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation). However, most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays. We propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context. Essentially, the model learns to dynamically size its lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains stat
Authors
(none)
Tags
Stats
Related papers
- Streaming Joint Speech Recognition And Disfluency Detection (2022)0.00
- Dynamic Latency For Ctc-based Streaming Automatic Speech Recognition With Emformer (2022)0.00
- Adapting Offline Speech Translation Models For Streaming With Future-aware Distillation And Inference (2023)3.58
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58
- Minimum Latency Training Strategies For Streaming Sequence-to-sequence ASR (2020)10.07
- Minimum Latency Training Of Sequence Transducers For Streaming End-to-end Speech Recognition (2022)0.00
- Controllable Time-delay Transformer For Real-time Punctuation Prediction And Disfluency Detection (2020)10.48
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00