Abstract
Multi-dialect speech recognition is essential for the widespread adoption of intelligent interactive systems and the preservation of dialectal language resources. However, existing approaches often tightly couple acoustic and language modeling, and the prevailing one-dialect-one-model paradigm results in parameter redundancy while failing to exploit shared characteristics across dialects. In this paper, we propose an exploratory unified cross-dialect speech recognition framework based on a tonal Pinyin intermediate representation, which decouples the original speech-to-Chinese character task into two stages: cross-dialect acoustic modeling from speech to Pinyin, and Pinyin-to-Chinese character decoding using a language model. This study focuses primarily on Mandarin dialects while exploring the feasibility of extending the framework to more distant, low-resource varieties (Cantonese and Shanghainese). The core contribution of this framework lies in enabling unified modeling of eleven language variants through a shared Pinyin intermediate layer, allowing a systematic investigation of cross-dialect feature sharing in the Pinyin representation space. The primary contribution lies in efficiency and unified modeling across dialects rather than achieving state-of-the-art performance on a single dialect. Experimental results show that the proposed Pinyin-based two-stage system achieves an average Chinese character error rate (CER) of 20.96% across all 11 dialects, compared with a direct Chinese character baseline CER of 22.97%, achieving an average relative improvement of 11.03%. While Mandarin dialects achieve practically usable performance, non-Mandarin dialects remain at preliminary levels, underscoring the need for dialect-specific adaptations and larger training corpora in future work. On the AISHELL-1 standard Mandarin dataset, the first-stage acoustic model attains 1.57% Pinyin CER (3.42% Pinyin WER), surpassing previous state-of-the-art methods trained on the same AISHELL-1 dataset under comparable data conditions. Furthermore, analysis of the Pinyin intermediate representation reveals systematic phonetic correspondences across dialects at the level of fundamental pronunciation units. These findings provide empirical evidence supporting unified and generalizable multi-dialect speech recognition architectures, particularly for Mandarin dialect groups.