Visual Speech Recognition For Languages With Limited Labeled Data Using Automatic Labels From Whisper
2023 Β· Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, et al.
Abstract
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlab
Authors
(none)
Tags
Stats
Related papers
- Litevsr: Efficient Visual Speech Recognition By Learning From Speech Representations Of Unlabeled Data (2023)5.84
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Mwhisper-flamingo For Multilingual Audio-visual Noise-robust Speech Recognition (2025)3.58
- Whisper Turns Stronger: Augmenting Wav2vec 2.0 For Superior ASR In Low-resource Languages (2024)0.00
- Whisper Speaker Identification: Leveraging Pre-trained Multilingual Transformers For Robust Speaker Embeddings (2025)0.00
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76
- Whistle: Data-efficient Multilingual And Crosslingual Speech Recognition Via Weakly Phonetic Supervision (2024)10.38