Dual-path Cross-modal Attention For Better Audio-visual Speech Extraction
2022 Β· Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson
Abstract
Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN \cite\{dprnn\}, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like \cite\{sepformer\}, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk self-attention, but in the proposed algorithm, inter-chunk attention incorporates the visual features as
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Separate In The Speech Chain: Cross-modal Conditional Audio-visual Target Speech Extraction (2024)0.00
- Av-sepformer: Cross-attention Sepformer For Audio-visual Target Speaker Extraction (2023)0.00
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34