Abstract
This paper introduces a unified approach to multilingual visual speech recognition (VSR) that combines cross-modal phonetic modeling with large-scale language decoding to enable robust generalization across low-resource and previously unseen languages. The architecture within the approach includes a Cross-Modal Transcriber that encodes synchronized audio-visual speech inputs into a language-agnostic phoneme space via a fine-grained cross-attention mechanism. To bridge perception and language understanding, two decoding pathways are explored: (1) a modular configuration that maps phonetic sequences to text using a pretrained large language model (LLM), and (2) an end-to-end formulation in which fused visual features are projected into the LLMβs embedding space via a lightweight adapter for direct transcription. Experimental evaluations on the mTEDx multilingual corpus show that the architecture surpasses state-of-the-art VSR models, achieving up to a 6% absolute improvement in WER across Latin-derived languages.