Separate In The Speech Chain: Cross-modal Conditional Audio-visual Target Speech Extraction
2024 Β· Zhaoxi Mu, Xinyu Yang
Abstract
The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semanti
Authors
(none)
Tags
Stats
Related papers
- Av-sepformer: Cross-attention Sepformer For Audio-visual Target Speaker Extraction (2023)0.00
- Listening While Speaking And Visualizing: Improving ASR Through Multimodal Chain (2019)4.52
- Looking Into Your Speech: Learning Cross-modal Affinity For Audio-visual Speech Separation (2021)11.67
- Dual-path Cross-modal Attention For Better Audio-visual Speech Extraction (2022)0.00
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Multi-modal Multi-correlation Learning For Audio-visual Speech Separation (2022)5.84
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00
- An Overview Of Deep-learning-based Audio-visual Speech Enhancement And Separation (2020)18.31