Takin-vc: Expressive Zero-shot Voice Conversion Via Adaptive Hybrid Content Encoding And Enhanced Timbre Modeling
2024 Β· Yuguang Yang, Yu Pan, Jixun Yao, et al.
Abstract
Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems struggle to fully reproduce paralinguistic information in highly expressive speech, such as breathing, crying, and emotional nuances, limiting their practical applicability. To address these issues, we propose Takin-VC, a novel expressive zero-shot VC framework via adaptive hybrid content encoding and memory-augmented context-aware timbre modeling. Specifically, we introduce an innovative hybrid content encoder that incorporates an adaptive fusion module, capable of effectively integrating quantized features of the pre-trained WavLM and HybridFormer in an implicit manner, so as to extract precise
Authors
(none)
Tags
Stats
Related papers
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58
- Zero-shot Voice Conversion Via Content-aware Timbre Ensemble And Conditional Flow Matching (2024)0.00
- Codiff-vc: A Codec-assisted Diffusion Model For Zero-shot Voice Conversion (2024)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Maskvct: Masked Voice Codec Transformer For Zero-shot Voice Conversion With Increased Controllability Via Multiple Guidances (2025)0.00
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Expressive-vc: Highly Expressive Voice Conversion With Attention Fusion Of Bottleneck And Perturbation Features (2022)9.03
- SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System For Both Human Beings And Machines (2021)8.09