Unsupervised Audiovisual Synthesis Via Exemplar Autoencoders
2020 Β· Kangle Deng, Aayush Bansal, Deva Ramanan
Abstract
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring \{\em any\} training data for the input speaker. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis, and provide extensive qualitative analysis on our project page -- https://www.cs.cmu.edu/~exemplar-ae/.
Authors
(none)
Tags
Stats
Related papers
- Diffv2s: Diffusion-based Video-to-speech Synthesis With Vision-guided Speaker Embedding (2023)8.82
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Robust Unsupervised Audio-visual Speech Enhancement Using A Mixture Of Variational Autoencoders (2019)9.23
- Enhanced Exemplar Autoencoder With Cycle Consistency Loss In Any-to-one Voice Conversion (2022)0.00
- Video-to-audio Generation With Hidden Alignment (2024)0.00
- Autoencoder Based Architecture For Fast & Real Time Audio Style Transfer (2018)0.00