End-to-end Automatic Speech Recognition Integrated With Ctc-based Voice Activity Detection
2020 Β· Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, et al.
Abstract
This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF l
Authors
(none)
Tags
Stats
Related papers
- Incorporating VAD Into ASR System By Multi-task Learning (2021)4.52
- Semantic VAD: Low-latency Voice Activity Detection For Speech Interaction (2023)6.34
- Speech Enhancement Aided End-to-end Multi-task Learning For Voice Activity Detection (2020)11.49
- Advancing VAD Systems Based On Multi-task Learning With Improved Model Structures (2023)0.00
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43