ESARM: 3D Emotional Speech-to-animation Via Reward Model From Automatically-ranked Demonstrations
2024 Β· Xulong Zhang, Xiaoyang Qu, Haoxiang Shi, et al.
Abstract
This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effec
Authors
(none)
Tags
Stats
Related papers
- Cstalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation (2024)3.58
- Probtalk3d: Non-deterministic Emotion Controllable Speech-driven 3D Facial Animation Synthesis Using VQ-VAE (2024)11.53
- Emotiongesture: Audio-driven Diverse Emotional Co-speech 3D Gesture Generation (2023)10.97
- Facexhubert: Text-less Speech-driven E(x)pressive 3D Facial Animation Synthesis Using Self-supervised Speech Representation Learning (2023)11.49
- Emogene: Audio-driven Emotional 3D Talking-head Generation (2024)2.26
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35
- Reinforcement Learning For Emotional Text-to-speech Synthesis With Improved Emotion Discriminability (2021)0.00
- Controllable Expressive 3D Facial Animation Via Diffusion In A Unified Multimodal Space (2025)0.00