An Empirical Analysis Of Deep Audio-visual Models For Speech Recognition
2018 Β· Devesh Walawalkar, Yihui He, Rohit Pillai
Abstract
In this project, we worked on speech recognition, specifically predicting individual words based on both the video frames and audio. Empowered by convolutional neural networks, the recent speech recognition and lip reading models are comparable to human level performance. We re-implemented and made derivations of the state-of-the-art model. Then, we conducted rich experiments including the effectiveness of attention mechanism, more accurate residual network as the backbone with pre-trained weights and the sensitivity of our model with respect to audio input with/without noise.
Authors
(none)
Tags
Stats
Related papers
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Evaluating Raw Waveforms With Deep Learning Frameworks For Speech Emotion Recognition (2023)0.00
- An Overview Of Deep-learning-based Audio-visual Speech Enhancement And Separation (2020)18.31
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Analyzing Hidden Representations In End-to-end Automatic Speech Recognition Systems (2017)0.00
- An Empirical Study Of Visual Features For DNN Based Audio-visual Speech Enhancement In Multi-talker Environments (2020)3.58