Abstract
Automatic Speech Recognition (ASR) is increasingly adopted in healthcare for automating medical documentation, supporting telemedicine services, and developing assistive interfaces for patients with speech impairments. However, the practical deployment of ASR in clinical environments is hindered by a significant decline in recognition performance compared to controlled laboratory settings. The primary limitation is not model architecture but the scarcity, imbalance, and low representativeness of available clinical speech datasets, which often fail to capture the acoustic variability and pathological speech patterns observed in real-world clinical scenarios. This study presents a formalized methodology for constructing a robust combined medical speech dataset that incorporates both normative and pathological speech, augmented with analytically controlled transformations to emulate clinical acoustic conditions. Pathological speech is modeled through a composition of monotonic temporal, formant, and phonatory perturbations constrained within physiologically plausible limits, ensuring the preservation of linguistic content and diagnostically relevant acoustic features. The proposed dataset is validated through acousticβspectral analysis, statistical evaluation of self-supervised embeddings, and ASR-based functional testing. Experimental results demonstrate that the augmented dataset enhances ASR robustness, reduces sensitivity to pathological variability, and mitigates domain shift, without compromising critical diagnostic cues. The methodology provides a reproducible framework for dataset engineering, establishing a foundation for scalable, reliable, and ethically compliant clinical speech technologies.