Human Listening And Live Captioning: Multi-task Training For Speech Enhancement
2021 Β· Sefik Emre Eskimez, Xiaofei Wang, Min Tang, et al.
Abstract
With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework to make the SE models unharmful to ASR. Because most ASR training samples do not have corresponding clean signal references, we alternately perform two model update steps called SE-step and ASR-step. The SE-step uses clean and noisy signal pairs and a signal-based loss function. The ASR-step applies a pre-trained ASR model to training signals enhanced with the SE model. A cross-entropy loss between the ASR output and reference transcriptions is calculated to update the SE model parameters. Experimental results with realistic large-scale settings using ASR models trained on 75,000-hour data show
Authors
(none)
Tags
Stats
Related papers
- Joint Training Of Speech Enhancement And Self-supervised Model For Noise-robust ASR (2022)0.00
- How Does End-to-end Speech Recognition Training Impact Speech Enhancement Artifacts? (2023)7.50
- Bridging The Gap: Integrating Pre-trained Speech Enhancement And Recognition Models For Robust Speech Recognition (2024)7.50
- Toward Universal Speech Enhancement For Diverse Input Conditions (2023)0.00
- On The Efficacy And Noise-robustness Of Jointly Learned Speech Emotion And Automatic Speech Recognition (2023)3.58
- Reinforcement Learning Based Speech Enhancement For Robust Speech Recognition (2018)11.08
- Towards Decoupling Frontend Enhancement And Backend Recognition In Monaural Robust ASR (2024)4.52
- Lisennet: Lightweight Sub-band And Dual-path Modeling For Real-time Speech Enhancement (2024)9.03