Vsanet: Real-time Speech Enhancement Based On Voice Activity Detection And Causal Spatial Attention
2023 Β· Yuewei Zhang, Huanbin Zou, Jie Zhu
Abstract
The deep learning-based speech enhancement (SE) methods always take the clean speech's waveform or time-frequency spectrum feature as the learning target, and train the deep neural network (DNN) by reducing the error loss between the DNN's output and the target. This is a conventional single-task learning paradigm, which has been proven to be effective, but we find that the multi-task learning framework can improve SE performance. Specifically, we design a framework containing a SE module and a voice activity detection (VAD) module, both of which share the same encoder, and the whole network is optimized by the weighted loss of the two modules. Moreover, we design a causal spatial attention (CSA) block to promote the representation capability of DNN. Combining the VAD aided multi-task learning framework and CSA block, our SE network is named VSANet. The experimental results prove the benefits of multi-task learning and the CSA block, which give VSANet an excellent SE performance.
Authors
(none)
Tags
Stats
Related papers
- Speech Enhancement Aided End-to-end Multi-task Learning For Voice Activity Detection (2020)11.49
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Advancing VAD Systems Based On Multi-task Learning With Improved Model Structures (2023)0.00
- Efficient Encoder-decoder And Dual-path Conformer For Comprehensive Feature Learning In Speech Enhancement (2023)7.16
- Lstmse-net: Long Short Term Speech Enhancement Network For Audio-visual Speech Enhancement (2024)8.57
- Speech Enhancement Using Multi-stage Self-attentive Temporal Convolutional Networks (2021)14.15
- Lisennet: Lightweight Sub-band And Dual-path Modeling For Real-time Speech Enhancement (2024)9.03
- SE Territory: Monaural Speech Enhancement Meets The Fixed Virtual Perceptual Space Mapping (2023)0.00