Rethinking Processing Distortions: Disentangling The Impact Of Speech Enhancement Errors On Speech Recognition Performance
2024 Β· Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, et al.
Abstract
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact err
Authors
(none)
Tags
Stats
Related papers
- How Bad Are Artifacts?: Analyzing The Impact Of Speech Enhancement Errors On ASR (2022)13.17
- How Does End-to-end Speech Recognition Training Impact Speech Enhancement Artifacts? (2023)7.50
- Bridging The Gap: Integrating Pre-trained Speech Enhancement And Recognition Models For Robust Speech Recognition (2024)7.50
- Investigation Of Practical Aspects Of Single Channel Speech Separation For ASR (2021)7.81
- Improving Noise Robust Automatic Speech Recognition With Single-channel Time-domain Enhancement Network (2020)13.88
- Bridging The Gap Between Monaural Speech Enhancement And Recognition With Distortion-independent Acoustic Modeling (2019)7.50
- Speaker Reinforcement Using Target Source Extraction For Robust Automatic Speech Recognition (2022)7.50
- Reducing The Gap Between Pretrained Speech Enhancement And Recognition Models Using A Real Speech-trained Bridging Module (2025)2.26