Speech Reconstruction From Silent Tongue And Lip Articulation By Pseudo Target Generation And Domain Adversarial Training
2023 Β· Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Abstract
This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of our proposed method decreases by over 15% compared to the baseline. In addition,
Authors
(none)
Tags
Stats
Related papers
- Improved Speech Reconstruction From Silent Video (2017)13.34
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Speech Synthesis From Text And Ultrasound Tongue Image-based Articulatory Input (2021)0.00
- Silent Versus Modal Multi-speaker Speech Recognition From Ultrasound And Video (2021)6.77
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Extending Text-to-speech Synthesis With Articulatory Movement Prediction Using Ultrasound Tongue Imaging (2021)3.58
- Whispered-to-voiced Alaryngeal Speech Conversion With Generative Adversarial Networks (2018)9.41