Looking Into Your Speech: Learning Cross-modal Affinity For Audio-visual Speech Separation
2021 Β· Jiyoung Lee, Soo-Whan Chung, Sunok Kim, et al.
Abstract
In this paper, we address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. Thus, their performance heavily depends on the accuracy of audio-visual synchronization and the effectiveness of their representations. To overcome the frame discontinuity problem between two modalities due to transmission delay mismatch or jitter, we propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams. Given that the global term provides stability over a temporal sequence at the utterance-level, this resolves the label permutation problem characterized by inconsistent assignments. By extending the proposed cross-modal affinity on the complex network, we further improve the separation performance in the complex spectral domain.
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Multi-modal Multi-correlation Learning For Audio-visual Speech Separation (2022)5.84
- Time Domain Audio Visual Speech Separation (2019)14.62
- Separate In The Speech Chain: Cross-modal Conditional Audio-visual Target Speech Extraction (2024)0.00
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- Audio-visual Speech Separation And Dereverberation With A Two-stage Multimodal Network (2019)12.47
- Seeing Through The Conversation: Audio-visual Speech Separation Based On Diffusion Model (2023)7.50
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19