X-vector Based Voice Activity Detection For Multi-genre Broadcast Speech-to-text
2021 Β· Misa Ogura, Matt Haynes
Abstract
Voice Activity Detection (VAD) is a fundamental preprocessing step in automatic speech recognition. This is especially true within the broadcast industry where a wide variety of audio materials and recording conditions are encountered. Based on previous studies which indicate that xvector embeddings can be applied to a diverse set of audio classification tasks, we investigate the suitability of x-vectors in discriminating speech from noise. We find that the proposed x-vector based VAD system achieves the best reported score in detecting clean speech on AVA-Speech, whilst retaining robust VAD performance in the presence of noise and music. Furthermore, we integrate the x-vector based VAD system into an existing STT pipeline and compare its performance on multiple broadcast datasets against a baseline system with WebRTC VAD. Crucially, our proposed x-vector based VAD improves the accuracy of STT transcription on real-world broadcast audio
Authors
(none)
Tags
Stats
Related papers
- Waveform-based Voice Activity Detection Exploiting Fully Convolutional Networks With Multi-branched Encoders (2020)0.00
- Adversarial Multi-task Deep Learning For Noise-robust Voice Activity Detection With Low Algorithmic Delay (2022)2.26
- Semantic VAD: Low-latency Voice Activity Detection For Speech Interaction (2023)6.34
- Voice Activity Detection: Merging Source And Filter-based Information (2019)13.50
- Advancing VAD Systems Based On Multi-task Learning With Improved Model Structures (2023)0.00
- Incorporating VAD Into ASR System By Multi-task Learning (2021)4.52
- Personal VAD: Speaker-conditioned Voice Activity Detection (2019)13.05
- End-to-end Automatic Speech Recognition Integrated With Ctc-based Voice Activity Detection (2020)11.76