Abstract
The deep application of intelligent algorithms in speech recognition provides a technological foundation for oral training in educational settings. Addressing the challenges of complex noise types and significant accent variations in vocational English speech across multiple scenarios, this study constructs a cross-scenario speech recognition model based on transfer learning. It utilizes pre-trained Wav2Vec2.0 as the acoustic encoding foundation, achieving feature transfer through hierarchical freezing and fine-tuning of trainable layers. Additionally, it incorporates MMD feature domain alignment and Conformer temporal modeling to enhance cross-environment stability. Experimental results demonstrate that the model reduces the average word error rate (WER) from 19.7% to 13.6% across classroom, dormitory, and corridor scenarios, and from 24.7% to 17.6% in unseen scenarios—representing a 28% improvement. This confirms the model's strong generalization capability and practical value.