End-to-end Audiovisual Speech Activity Detection With Bimodal Recurrent Neural Models
2018 Β· Fei Tao, Carlos Busso
Abstract
Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a *bimodal recurrent neural network* (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visu
Authors
(none)
Tags
Stats
Related papers
- Temporarily-aware Context Modelling Using Generative Adversarial Networks For Speech Activity Detection (2020)7.50
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- Audio Visual Speech Recognition Using Deep Recurrent Neural Networks (2016)7.81
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Speech Activity Detection Based On Multilingual Speech Recognition System (2020)5.24
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- Speech Enhancement Aided End-to-end Multi-task Learning For Voice Activity Detection (2020)11.49