Facexhubert: Text-less Speech-driven E(x)pressive 3D Facial Animation Synthesis Using Self-supervised Speech Representation Learning
2023 Β· Kazi Injamamul Haque, Zerrin Yumak
Abstract
This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that allows to capture personalized and subtle cues in speech (e.g. identity, emotion and hesitation). It is also very robust to background noise and can handle audio recorded in a variety of situations (e.g. multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate facial animation for the whole face. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-synching, expressivity, person-specific information and generalizability. We effectively employ self-supervised pretrained HuBERT model in the training process that allows us to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Additionally, guiding the training with a binary emotion condition
Authors
(none)
Tags
Stats
Related papers
- Facediffuser: Speech-driven 3D Facial Animation Synthesis Using Diffusion (2023)13.79
- Cstalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation (2024)3.58
- Probabilistic Speech-driven 3D Facial Motion Synthesis: New Benchmarks, Methods, And Applications (2023)9.23
- Audio2face: Generating Speech/face Animation From Single Audio With Attention-based Bidirectional LSTM Networks (2019)12.10
- Probtalk3d: Non-deterministic Emotion Controllable Speech-driven 3D Facial Animation Synthesis Using VQ-VAE (2024)11.53
- ESARM: 3D Emotional Speech-to-animation Via Reward Model From Automatically-ranked Demonstrations (2024)0.00
- Facespeak: Expressive And High-quality Speech Synthesis From Human Portraits Of Different Styles (2025)0.00
- Fast-hubert: An Efficient Training Framework For Self-supervised Speech Representation Learning (2023)0.00