Zero-shot Voice Conversion Via Content-aware Timbre Ensemble And Conditional Flow Matching
2024 Β· Yu Pan, Yuguang Yang, Jixun Yao, et al.
Abstract
Despite recent advances in zero-shot voice conversion (VC), achieving speaker similarity and naturalness comparable to ground-truth recordings remains a significant challenge. In this letter, we propose CTEFM-VC, a zero-shot VC framework that integrates content-aware timbre ensemble modeling with conditional flow matching. Specifically, CTEFM-VC decouples utterances into content and timbre representations and leverages a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. To enhance its timbre modeling capability and naturalness of generated speech, we first introduce a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the effective utilization of source content and target timbre elements through a cross-attention module. Furthermore, a structural similarity-based timbre loss is presented to jointly train CTEFM-VC end-to-end. Experiments show that CTEFM-VC consistently achi
Authors
(none)
Tags
Stats
Related papers
- Takin-vc: Expressive Zero-shot Voice Conversion Via Adaptive Hybrid Content Encoding And Enhanced Timbre Modeling (2024)0.00
- Stablevc: Style Controllable Zero-shot Voice Conversion With Conditional Flow Matching (2024)7.81
- Enhancing Expressive Voice Conversion With Discrete Pitch-conditioned Flow Matching Model (2025)5.84
- SEF-VC: Speaker Embedding Free Zero-shot Voice Conversion With Cross Attention (2023)0.00
- Cycleflow: Leveraging Cycle Consistency In Flow Matching For Speaker Style Adaptation (2025)4.52
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Disentangling The Prosody And Semantic Information With Pre-trained Model For In-context Learning Based Zero-shot Voice Conversion (2024)4.52
- Codiff-vc: A Codec-assisted Diffusion Model For Zero-shot Voice Conversion (2024)0.00