Cross-Modal Meta-Generalization for Silent Speech Recognition via Ultrasound Tongue Images

Abstract

Silent Speech Recognition (SSR) technologies utilize assistive devices to facilitate specific human-computer interaction by decoding speech from non-acoustic biosignals generated during speech production. Current Ultrasound Tongue Images (UTIs)- based SSR technologies fail to address the domain-shift problem that arises from sensor displacement artifacts, articulatory patterns, and speaker deviations. Moreover, existing solutions seldom utilize cross-modal knowledge to enhance uni-modal generalization. This paper proposes a novel Domain Generalization (DG) method to learn domain-invariant representations from synchronized audio and lip video encoding, named the Cross-Modal Meta-Generalization (CMMG) framework. We proposed a multi-modal co-learning pipeline to randomly combine the tri-modal streams into a modality-agnostic feature space, constructing expanded domain-shift episodes from multi-modal source domains. Moreover, we designed several meta-tasks that involve classifying various speech attributes (e.g., tongue height) to implement cross-modal semantic regularization. Subsequently, the model is trained episodically, with each task or episode simulating a domain shift to seek optimal generalization in unseen uni-modal domains. Experimental results across various cross-domain sets demonstrate that the proposed method outperforms both uni-modal baselines and existing domain-shift solutions. Additionally, we explore the coupling information of multi-modal heterogeneous features through both segmental and sub-segmental level meta-tasks, finding that the proposed CMMG effectively enhances the generalization and discriminative capability of uni-modal representations.

Abstract

Related papers