Av-sepformer: Cross-attention Sepformer For Audio-visual Target Speaker Extraction
2023 Β· Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, et al.
Abstract
Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are
Authors
(none)
Tags
Stats
Related papers
- Separate In The Speech Chain: Cross-modal Conditional Audio-visual Target Speech Extraction (2024)0.00
- Dual-path Cross-modal Attention For Better Audio-visual Speech Extraction (2022)0.00
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Seeing Through The Conversation: Audio-visual Speech Separation Based On Diffusion Model (2023)7.50
- Target Speech Extraction With Pre-trained Av-hubert And Mask-and-recover Strategy (2024)4.52
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- X-crossnet: A Complex Spectral Mapping Approach To Target Speaker Extraction With Cross Attention Speaker Embedding Fusion (2024)0.00