Systematic Annotation Framework for Robust Speech Recognition

Abstract

This study proposes a systematic annotation framework to improve the robustness of end-to-end automatic speech recognition (ASR) in a complex low-resource dialect setting, using Hainan Lingao dialect as a case study. The framework consists of three components: semantically complete utterance segmentation instead of fixed-duration clipping; structured annotation at the lexical, sentence, and pragmatic-behavior levels, including explicit tags for dialectal variation, environmental noise, and unintelligible speech as well as rules for handling overlapping speech; and a three-stage quality-assurance workflow with iterative guideline refinement. The framework was implemented in the construction of a Hainan Lingao dialect corpus from 16 speakers and evaluated using 80 h/10 h/10 h training, validation, and test splits under an identical Conformer-based ASR configuration. Compared with a plain-transcription baseline using no special tags and fixed 3 s segmentation, the full specification reduced character error rate (CER) from 8.7% to 7.9%, 24.3% to 18.5%, 19.5% to 15.2%, and 15.2% to 13.1% on clean, noisy, dialogue, and dialect-variation test sets, respectively. The corresponding sentence error rate (SER) decreased from 17.5% to 15.2%, 39.6% to 32.1%, 34.2% to 27.8%, and 28.3% to 24.5%. Ablation experiments further examined the individual contributions of pragmatic-behavior tags, noise tags, semantic segmentation, and dialect-feature annotation. Paired bootstrap testing with 10,000 resamples showed that all baseline-to-full-specification improvements were statistically significant (p < 0.01). These results indicate that systematic annotation can improve ASR robustness in this Lingao low-resource dialect setting, with the largest relative CER reductions observed in the noisy (23.7%) and dialogue (22.1%) scenarios.

Abstract

Related papers