Fake It To Make It: Using Synthetic Data To Remedy The Data Shortage In Joint Multimodal Speech-and-gesture Synthesis
2024 Β· Shivam Mehta, Anna Deichler, Jim O'Regan, et al.
Abstract
Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the
Authors
(none)
Tags
Stats
Related papers
- Instruction Data Generation And Unsupervised Adaptation For Speech Language Models (2024)3.58
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Probabilistic Speech-driven 3D Facial Motion Synthesis: New Benchmarks, Methods, And Applications (2023)9.23
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60
- Unified Speech And Gesture Synthesis Using Flow Matching (2023)5.24
- Property-aware Multi-speaker Data Simulation: A Probabilistic Modelling Technique For Synthetic Data Generation (2023)6.34
- Generating Data With Text-to-speech And Large-language Models For Conversational Speech Recognition (2024)6.34
- Diffusion-based Co-speech Gesture Generation Using Joint Text And Audio Representation (2023)10.07