Analysis Of Deep Clustering As Preprocessing For Automatic Speech Recognition Of Sparsely Overlapping Speech
2019 · Tobias Menne, Ilya Sklyar, Ralf Schlüter, et al.
Abstract
Significant performance degradation of automatic speech recognition (ASR) systems is observed when the audio signal contains cross-talk. One of the recently proposed approaches to solve the problem of multi-speaker ASR is the deep clustering (DPCL) approach. Combining DPCL with a state-of-the-art hybrid acoustic model, we obtain a word error rate (WER) of 16.5 % on the commonly used wsj0-2mix dataset, which is the best performance reported thus far to the best of our knowledge. The wsj0-2mix dataset contains simulated cross-talk where the speech of multiple speakers overlaps for almost the entire utterance. In a more realistic ASR scenario the audio signal contains significant portions of single-speaker speech and only part of the signal contains speech of multiple competing speakers. This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech. To this end we present a data simulation approach, closely related t
Authors
(none)
Tags
Stats
Related papers
- Single-channel Multi-speaker Separation Using Deep Clustering (2016)0.00
- Multi-channel Speech Separation Using Deep Embedding Model With Multilayer Bootstrap Networks (2019)0.00
- Low-latency Deep Clustering For Speech Separation (2019)8.09
- Elevating Robust Multi-talker ASR By Decoupling Speaker Separation And Speech Recognition (2025)0.00
- Assessing The Robustness Of Spectral Clustering For Deep Speaker Diarization (2024)3.58
- Enhancements For Audio-only Diarization Systems (2019)0.00
- Highly Efficient Real-time Streaming And Fully On-device Speaker Diarization With Multi-stage Clustering (2022)0.00
- Directed Speech Separation For Automatic Speech Recognition Of Long Form Conversational Speech (2021)2.26