MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition
2024 Β· He Wang, Pengcheng Guo, Pan Zhou, et al.
Abstract
While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challen
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Global Interaction And Local Alignment For Audio-visual Speech Recognition (2023)7.50
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Visual Context-driven Audio Feature Enhancement For Robust End-to-end Audio-visual Speech Recognition (2022)10.07