Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations
2023 Β· Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, et al.
Abstract
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.
Authors
(none)
Tags
Stats
Related papers
- Data2vec: A General Framework For Self-supervised Learning In Speech, Vision And Language (2022)0.00
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation (2024)7.16
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60