Speaker-independent Speech-driven Visual Speech Synthesis Using Domain-adapted Acoustic Models
2019 Β· Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, et al.
Abstract
Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the AM on ten thousand hours of audio-only data. The AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to one with a random initialization. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly init
Authors
(none)
Tags
Stats
Related papers
- Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR (2023)7.81
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Video-driven Speech Reconstruction Using Generative Adversarial Networks (2019)11.39
- Prompt Tuning Of Deep Neural Networks For Speaker-adaptive Visual Speech Recognition (2023)0.00
- Audio2face: Generating Speech/face Animation From Single Audio With Attention-based Bidirectional LSTM Networks (2019)12.10
- A Highly Adaptive Acoustic Model For Accurate Multi-dialect Speech Recognition (2022)10.85
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76