Interactive Feature Fusion For End-to-end Noise-robust Speech Recognition
2021 Β· Yuchen Hu, Nana Hou, Chen Chen, et al.
Abstract
Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility. However, the over-suppression phenomenon in the enhanced speech might degrade the performance of downstream automatic speech recognition (ASR) task due to the missing latent information. To alleviate such problem, we propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition to learn complementary information from the enhanced feature and original noisy feature. Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline on RATS Channel-A corpus. Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.
Authors
(none)
Tags
Stats
Related papers
- Gated Recurrent Fusion With Joint Training Framework For Robust End-to-end Speech Recognition (2020)14.55
- Speech And Noise Dual-stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition (2023)5.24
- An Enhanced Res2net With Local And Global Feature Fusion For Speaker Verification (2023)19.74
- Bridging The Gap: Integrating Pre-trained Speech Enhancement And Recognition Models For Robust Speech Recognition (2024)7.50
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Interactive Speech And Noise Modeling For Speech Enhancement (2020)13.74
- Fusion Of Embeddings Networks For Robust Combination Of Text Dependent And Independent Speaker Recognition (2021)4.52
- Qifusion-net: Layer-adapted Stream/non-stream Model For End-to-end Multi-accent Speech Recognition (2024)3.58