Bridging the Reality Gap in ASR for Low-Resource Dysarthric Speech: Evaluating Performance on Synthetic and Real Data

Abstract

Automatic speech recognition (ASR) for dysarthric speech poses major challenges due to atypical articulation patterns, limited annotated corpora, and the difficulty of collecting large-scale real-world data, especially in low-resource languages like Polish. To overcome this, synthetic dysarthric speech is often used to train ASR models. In this study, we introduce a novel annotated real-world test set of Polish dysarthric speech, curated by neuroscientific experts, designed for deep learning applications. We also introduce three synthetic Polish dysarthric datasets generated using text-to-speech and voice-cloning techniques. We perform a comprehensive statistical and acoustic analysis comparing natural and synthetic dysarthric speech, demonstrating clear separability in their acoustic feature distributions. A transformer-based ASR model fine-tuned on a mix of natural and synthetic data achieves a word error rate (WER) of <inline-formula> <tex-math notation="LaTeX"> $20.54 \pm 0.72$ </tex-math></inline-formula>% on Polish synthetic samples, but a much higher <inline-formula> <tex-math notation="LaTeX"> $71.91 \pm 1.84$ </tex-math></inline-formula>% WER on Polish real dysarthric speech. Furthermore, our model achieves a WER of <inline-formula> <tex-math notation="LaTeX"> $22.87 \pm 0.95$ </tex-math></inline-formula>% on the UaSpeech dataset, which is close to the best reported result of 20.61% in the literature. This performance gap underscores a critical limitation: models trained on synthetic data may overfit to artifacts of the generation process, thereby reducing generalizability to real-world scenarios. We discuss the implications of this gap and propose steps toward building more robust and inclusive ASR systems for dysarthric speakers in low-resource contexts.

Abstract

Related papers