Transformer-s2a: Robust And Efficient Speech-to-animation
2021 Β· Liyang Chen, Zhiyong Wu, Jun Ling, et al.
Abstract
We propose a novel robust and efficient Speech-to-Animation (S2A) approach for synchronized facial animation generation in human-computer interaction. Compared with conventional approaches, the proposed approach utilizes phonetic posteriorgrams (PPGs) of spoken phonemes as input to ensure the cross-language and cross-speaker ability, and introduces corresponding prosody features (i.e. pitch and energy) to further enhance the expression of generated animation. Mixture-of-experts (MOE)-based Transformer is employed to better model contextual information while provide significant optimization on computation efficiency. Experiments demonstrate the effectiveness of the proposed approach on both objective and subjective evaluation with 17x inference speedup compared with the state-of-the-art approach.
Authors
(none)
Tags
Stats
Related papers
- Audio2face: Generating Speech/face Animation From Single Audio With Attention-based Bidirectional LSTM Networks (2019)12.10
- Daspeech: Directed Acyclic Transformer For Fast And High-quality Speech-to-speech Translation (2023)5.24
- S-transformer: Segment-transformer For Robust Neural Speech Synthesis (2020)0.00
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Pmmtalk: Speech-driven 3D Facial Animation From Complementary Pseudo Multi-modal Features (2023)3.58
- Diffspeaker: Speech-driven 3D Facial Animation With Diffusion Transformer (2024)5.24
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Recom: Realistic Co-speech Motion Generation With Recurrent Embedded Transformer (2025)0.00