Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition
2022 Β· Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, et al.
Abstract
With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recogniti
Authors
(none)
Tags
Stats
Related papers
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67