Learning Audio-visual Speech Representation By Masked Multimodal Cluster Prediction
2022 Β· Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, et al.
Abstract
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation
Authors
(none)
Tags
Stats
Related papers
- Hubert: Self-supervised Speech Representation Learning By Masked Prediction Of Hidden Units (2021)25.30
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60
- Target Speech Extraction With Pre-trained Av-hubert And Mask-and-recover Strategy (2024)4.52
- U-hubert: Unified Mixed-modal Speech Pretraining And Zero-shot Transfer To Unlabeled Modality (2022)5.99
- Spatial Hubert: Self-supervised Spatial Speech Representation Learning For A Single Talker From Multi-channel Audio (2023)0.00
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Ms-hubert: Mitigating Pre-training And Inference Mismatch In Masked Language Modelling Methods For Learning Speech Representations (2024)4.52
- Practice Of The Conformer Enhanced AUDIO-VISUAL HUBERT On Mandarin And English (2023)4.52