Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation
2024 Β· Qiushi Zhu, Jie Zhang, Yu Gu, et al.
Abstract
Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we
Authors
(none)
Tags
Stats
Related papers
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Multi-task Voice Activated Framework Using Self-supervised Learning (2021)6.34
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Ccc-wav2vec 2.0: Clustering Aided Cross Contrastive Self-supervised Learning Of Speech Representations (2022)7.81
- Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement (2023)0.00
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29