DCIM-AVSR : Efficient Audio-visual Speech Recognition Via Dual Conformer Interaction Module
2024 · Xinyu Wang, Haotian Jiang, Haolin Huang, et al.
Abstract
Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a pre-training method that further optimizes model performance by selectively updating parameters, leading
Authors
(none)
Tags
Stats
Related papers
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition (2020)10.97
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Tailored Design Of Audio-visual Speech Recognition Models Using Branchformers (2024)2.35
- Hourglass-avsr: Down-up Sampling-based Computational Efficiency Model For Audio-visual Speech Recognition (2023)0.00
- AKVSR: Audio Knowledge Empowered Visual Speech Recognition By Compressing Audio Knowledge Of A Pretrained Model (2023)8.35
- Practice Of The Conformer Enhanced AUDIO-VISUAL HUBERT On Mandarin And English (2023)4.52