Mwhisper-flamingo For Multilingual Audio-visual Noise-robust Speech Recognition
2025 Β· Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, et al.
Abstract
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
Authors
(none)
Tags
Stats
Related papers
- XLAVS-R: Cross-lingual Audio-visual Speech Representation Learning For Noise-robust Speech Perception (2024)7.50
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Visual Speech Recognition For Languages With Limited Labeled Data Using Automatic Labels From Whisper (2023)2.46
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Robust Wake Word Spotting With Frame-level Cross-modal Attention Based Audio-visual Conformer (2024)5.24