Audio Visual Speech Recognition Using Deep Recurrent Neural Networks
2016 Β· Abhinav Thanda, Shankar M Venkatesan
Abstract
In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual features using a deep bottleneck network. Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. Our system is evaluated on GRID corpus. Our results show that presence of visual modality gives significant improvement in character error rate (CER) at various levels of noise even when the model is trained without noisy data. We also provide a comparison of two fusion methods: feature fusion and decision fusion.
Authors
(none)
Tags
Stats
Related papers
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Recurrent Neural Network Transducer For Audio-visual Speech Recognition (2019)0.00
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Audio-visual Speech Separation In Noisy Environments With A Lightweight Iterative Model (2023)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition (2020)10.97
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00