Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings
2022 Β· I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, et al.
Abstract
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to
Authors
(none)
Tags
Stats
Related papers
- Target Speech Extraction With Pre-trained Av-hubert And Mask-and-recover Strategy (2024)4.52
- Learning Audio-visual Speech Representation By Masked Multimodal Cluster Prediction (2022)5.99
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24
- Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement (2023)0.00
- Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation (2024)7.16
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00