Visual Context-driven Audio Feature Enhancement For Robust End-to-end Audio-visual Speech Recognition
2022 Β· Joanna Hong, Minsu Kim, Daehun Yoo, et al.
Abstract
This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context. Through context-dependent modeling, the ambiguity in viseme-to-phoneme mapping can be refined for mask generation. The noisy representations are masked out with the noise reduction mask resulting in enhanced audio features. The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition. We show the proposed end-to-end AVSR with the V-CAFE can further improve the noise-robustness of AVSR. The effectiveness of the proposed method is evaluate
Authors
(none)
Tags
Stats
Related papers
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- End-to-end Multi-talker Audio-visual ASR Using An Active Speaker Attention Module (2022)0.00
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Robust Unsupervised Audio-visual Speech Enhancement Using A Mixture Of Variational Autoencoders (2019)9.23