Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision
2020 Β· Abhinav Shukla, Stavros Petridis, Maja Pantic
Abstract
The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and s
Authors
(none)
Tags
Stats
Related papers
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Learning Self-supervised Audio-visual Representations For Sound Recommendations (2024)2.26
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Investigating Self-supervised Representations For Audio-visual Deepfake Detection (2025)0.00
- Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation (2024)7.16
- Lira: Learning Visual Speech Representations From Audio Through Self-supervision (2021)11.58
- Conformer-based Self-supervised Learning For Non-speech Audio Tasks (2021)7.50
- Learning Problem-agnostic Speech Representations From Multiple Self-supervised Tasks (2019)15.54