Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation
2022 Β· Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, et al.
Abstract
In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.
Authors
(none)
Tags
Stats
Related papers
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation (2024)7.16
- Learning Audio-visual Speech Representation By Masked Multimodal Cluster Prediction (2022)5.99
- Audio-visual Representation Learning Via Knowledge Distillation From Speech Foundation Models (2025)7.81
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29