Improving Dual-microphone Speech Enhancement By Learning Cross-channel Features With Multi-head Attention
2022 Β· Xinmeng Xu, Rongzhi Gu, Yuexian Zou
Abstract
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for DMSE using a multi-head cross-attention based convolutional recurrent network (MHCA-CRN) is presented. The proposed MHCA-CRN model includes a channel-wise encoding structure for preserving intra-channel features and a multi-head cross-attention mechanism for fully exploiting cross-channel features. In addition, the proposed approach specifically formulates the decoder with an extra SNR estimator to estimate frame-level SNR under a multi-task learning framework, which is expected to avoid speech distortion led by end-to-end DMSE module. Finally, a spectral gain function is adopted to further su
Authors
(none)
Tags
Stats
Related papers
- Enhancing End-to-end Multi-channel Speech Separation Via Spatial Feature Learning (2020)12.47
- Mfcca:multi-frame Cross-channel Attention For Multi-speaker ASR In Multi-party Meeting Scenario (2022)7.81
- Spatial-dccrn: Dccrn Equipped With Frame-level Angle Feature And Hybrid Filtering For Multi-channel Speech Enhancement (2022)5.84
- Lmfca-net: A Lightweight Model For Multi-channel Speech Enhancement With Efficient Narrow-band And Cross-band Attention (2025)3.58
- Efficient Multi-channel Speech Enhancement With Spherical Harmonics Injection For Directional Encoding (2023)3.58
- Leveraging Joint Spectral And Spatial Learning With MAMBA For Multichannel Speech Enhancement (2024)0.00
- Automatic Channel Selection And Spatial Feature Integration For Multi-channel Speech Recognition Across Various Array Topologies (2023)8.09
- BSS-CFFMA: Cross-domain Feature Fusion And Multi-attention Speech Enhancement Network Based On Self-supervised Embedding (2024)4.52