Target Speech Extraction With Pre-trained Av-hubert And Mask-and-recover Strategy
2024 Β· Wenxuan Wu, Xueyuan Chen, Xixin Wu, et al.
Abstract
Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the propose
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Enhancement And Separation By Utilizing Multi-modal Self-supervised Embeddings (2022)8.60
- Learning Audio-visual Speech Representation By Masked Multimodal Cluster Prediction (2022)5.99
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Ms-hubert: Mitigating Pre-training And Inference Mismatch In Masked Language Modelling Methods For Learning Speech Representations (2024)4.52
- Incorporating Ultrasound Tongue Images For Audio-visual Speech Enhancement (2023)0.00
- Separate In The Speech Chain: Cross-modal Conditional Audio-visual Target Speech Extraction (2024)0.00