Abstract
Automatic speech recognition (ASR) for dysarthric speech poses major challenges due to atypical articulation patterns, limited annotated corpora, and the difficulty of collecting large-scale real-world data, especially in low-resource languages like Polish. To overcome this, synthetic dysarthric speech is often used to train ASR models. In this study, we introduce a novel annotated real-world test set of Polish dysarthric speech, curated by neuroscientific experts, designed for deep learning applications. We also introduce three synthetic Polish dysarthric datasets generated using text-to-speech and voice-cloning techniques. We perform a comprehensive statistical and acoustic analysis comparing natural and synthetic dysarthric speech, demonstrating clear separability in their acoustic feature distributions. A transformer-based ASR model fine-tuned on a mix of natural and synthetic data achieves a word error rate (WER) of <inline-formula> <tex-math notation="LaTeX">$20.54\pm 0.72$ </tex-math></inline-formula>% on Polish synthetic samples, but a much higher <inline-formula> <tex-math notation="LaTeX">$71.91\pm 1.84$ </tex-math></inline-formula>% WER on Polish real dysarthric speech. Furthermore, our model achieves a WER of <inline-formula> <tex-math notation="LaTeX">$22.87~\pm ~0.95$ </tex-math></inline-formula>% on the UaSpeech dataset, which is close to the best reported result of 20.61% in the literature. This performance gap underscores a critical limitation: models trained on synthetic data may overfit to artifacts of the generation process, thereby reducing generalizability to real-world scenarios. We discuss the implications of this gap and propose steps toward building more robust and inclusive ASR systems for dysarthric speakers in low-resource contexts.