Audiovisual Speaker Conversion: Jointly And Simultaneously Transforming Facial Expression And Acoustic Characteristics
2018 Β· Fuming Fang, Xin Wang, Junichi Yamagishi, et al.
Abstract
An audiovisual speaker conversion method is presented for simultaneously transforming the facial expressions and voice of a source speaker into those of a target speaker. Transforming the facial and acoustic features together makes it possible for the converted voice and facial expressions to be highly correlated and for the generated target speaker to appear and sound natural. It uses three neural networks: a conversion network that fuses and transforms the facial and acoustic features, a waveform generation network that produces the waveform from both the converted facial and acoustic features, and an image reconstruction network that outputs an RGB facial image also based on both the converted features. The results of experiments using an emotional audiovisual database showed that the proposed method achieved significantly higher naturalness compared with one that separately transformed acoustic and facial features.
Authors
(none)
Tags
Stats
Related papers
- Expressive Voice Conversion: A Joint Framework For Speaker Identity And Emotional Style Transfer (2021)9.03
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Using Joint Training Speaker Encoder With Consistency Loss To Achieve Cross-lingual Voice Conversion And Expressive Voice Conversion (2023)0.00
- Expressive-vc: Highly Expressive Voice Conversion With Attention Fusion Of Bottleneck And Perturbation Features (2022)9.03
- Accent And Speaker Disentanglement In Many-to-many Voice Conversion (2020)10.35
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64