Audio-visual Representation Learning Via Knowledge Distillation From Speech Foundation Models
2025 Β· Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, et al.
Abstract
Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that
Authors
(none)
Tags
Stats
Related papers
- Lip-listening: Mixing Senses To Understand Lips Using Cross Modality Knowledge Distillation For Word-based Models (2022)0.00
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Spatio-temporal Attention Mechanism And Knowledge Distillation For Lip Reading (2021)0.00
- S-SONDO: Self-supervised Knowledge Distillation For General Audio Foundation Models (2026)1.82
- Application Of Knowledge Distillation To Multi-task Speech Representation Learning (2022)2.26
- Integrated Multi-level Knowledge Distillation For Enhanced Speaker Verification (2024)0.00
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- Audio Representation Learning By Distilling Video As Privileged Information (2023)0.00