Multi-modal Multi-correlation Learning For Audio-visual Speech Separation
2022 Β· Xiaoyu Wang, Xiangyu Kong, Xiulian Peng, et al.
Abstract
In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation. Although previous efforts have been extensively put on combining audio and visual modalities, most of them solely adopt a straightforward concatenation of audio and visual features. To exploit the real useful information behind these two modalities, we define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation (between phoneme and lip motion). These two correlations together comprise the complete information, which shows a certain superiority in separating target speaker's voice especially in some hard cases, such as the same gender or similar content. For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations. Both of them work well, while adversarial training shows its advantage by avoiding some limitations of contrastive le
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Looking Into Your Speech: Learning Cross-modal Affinity For Audio-visual Speech Separation (2021)11.67
- Audio-visual Speech Separation And Dereverberation With A Two-stage Multimodal Network (2019)12.47
- Time Domain Audio Visual Speech Separation (2019)14.62
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Audio-visual Multi-channel Speech Separation, Dereverberation And Recognition (2022)6.77
- Separate In The Speech Chain: Cross-modal Conditional Audio-visual Target Speech Extraction (2024)0.00