Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention
2022 Β· Junwen Xiong, Peng Zhang, Lei Xie, et al.
Abstract
Multi-modal based speech separation has exhibited a specific advantage on isolating the target character in multi-talker noisy environments. Unfortunately, most of current separation strategies prefer a straightforward fusion based on feature learning of each single modality, which is far from sufficient consideration of inter-relationships between modalites. Inspired by learning joint feature representations from audio and visual streams with attention mechanism, in this study, a novel cross-modal fusion strategy is proposed to benefit the whole framework with semantic correlations between different modalities. To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated to strengthen the robustness of visual representation. The evaluation of the proposed work is performed on two public audio-visual speech separation benchmark datasets. The overall improvement of the performance has demonstrated that the additional motion network effectively
Authors
(none)
Tags
Stats
Related papers
- Looking Into Your Speech: Learning Cross-modal Affinity For Audio-visual Speech Separation (2021)11.67
- Multi-modal Multi-correlation Learning For Audio-visual Speech Separation (2022)5.84
- Seeing Through The Conversation: Audio-visual Speech Separation Based On Diffusion Model (2023)7.50
- Joint Speaker Features Learning For Audio-visual Multichannel Speech Separation And Recognition (2024)0.00
- Dual-path Cross-modal Attention For Better Audio-visual Speech Extraction (2022)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Audio-visual Speech Separation And Dereverberation With A Two-stage Multimodal Network (2019)12.47
- Time Domain Audio Visual Speech Separation (2019)14.62