Seeing Your Speech Style: A Novel Zero-shot Identity-disentanglement Face-based Voice Conversion
2024 Β· Yan Rong, Li Liu
Abstract
Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive e
Authors
(none)
Tags
Stats
Related papers
- Face-driven Zero-shot Voice Conversion With Memory-based Face-voice Alignment (2023)5.84
- ZSVC: Zero-shot Style Voice Conversion With Disentangled Latent Diffusion Models And Adversarial Training (2025)0.00
- SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System For Both Human Beings And Machines (2021)8.09
- Disentanglement Of Emotional Style And Speaker Identity For Expressive Voice Conversion (2021)10.97
- Beyond Voice Identity Conversion: Manipulating Voice Attributes By Adversarial Learning Of Structured Disentangled Representations (2021)0.00
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Improving Zero-shot Voice Style Transfer Via Disentangled Representation Learning (2021)0.00
- Zero-shot Personalized Lip-to-speech Synthesis With Face Image Based Voice Control (2023)5.84