Silent Versus Modal Multi-speaker Speech Recognition From Ultrasound And Video
2021 Β· Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, et al.
Abstract
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal spe
Authors
(none)
Tags
Stats
Related papers
- Speech Reconstruction From Silent Tongue And Lip Articulation By Pseudo Target Generation And Domain Adversarial Training (2023)5.84
- Speech Synthesis From Text And Ultrasound Tongue Image-based Articulatory Input (2021)0.00
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Incorporating Ultrasound Tongue Images For Audio-visual Speech Enhancement (2023)0.00
- Lipper: Synthesizing Thy Speech Using Multi-view Lipreading (2019)10.61
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Multimodal Audio-textual Architecture For Robust Spoken Language Understanding (2023)0.00