Smart Speech Segmentation Using Acousto-linguistic Features With Look-ahead
2022 Β· Piyush Behre, Naveen Parihar, Sharman Tan, et al.
Abstract
Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeouts or voice activity detectors (VADs), which are both limited to acoustic features. This segmentation is often overly aggressive, given that people naturally pause to think as they speak. Consequently, segmentation happens mid-sentence, hindering both punctuation and downstream tasks like machine translation for which high-quality segmentation is critical. Model-based segmentation methods that leverage acoustic features are powerful, but without an understanding of the language itself, these approaches are limited. We present a hybrid approach that leverages both acoustic and language information to improve segmentation. Furthermore, we show that including one word as a look-ahead boosts segmentation quality. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. We show that this approach works for multiple languages. For the downstream task of machine translation
Authors
(none)
Tags
Stats
Related papers
- Speech Segmentation Optimization Using Segmented Bilingual Speech Corpus For End-to-end Speech Translation (2022)5.84
- Beyond Voice Activity Detection: Hybrid Audio Segmentation For Direct Speech Translation (2021)0.00
- Don't Discard Fixed-window Audio Segmentation In Speech-to-text Translation (2022)0.00
- Unsupervised Speech Segmentation: A General Approach Using Speech Language Models (2025)2.60
- Speech Decomposition Based On A Hybrid Speech Model And Optimal Segmentation (2021)0.00
- Subtitles To Segmentation: Improving Low-resource Speech-to-text Translation Pipelines (2020)0.00
- Reading Between The Waves: Robust Topic Segmentation Using Inter-sentence Audio Features (2026)0.00
- Long-form Speech Translation Through Segmentation With Finite-state Decoding Constraints On Large Language Models (2023)0.00