QR-VC: Leveraging Quantization Residuals For Linear Disentanglement In Zero-shot Voice Conversion
2024 Β· Youngjun Sim, Jinsung Yoon, Wooyeol Jeong, et al.
Abstract
Zero-shot voice conversion is a technique that alters the speaker identity of an input speech to match a target speaker using only a single reference utterance, without requiring additional training. Recent approaches extensively utilize self-supervised learning features with K-means quantization to extract high-quality content representations while removing speaker identity. However, this quantization process also eliminates fine-grained phonetic and prosodic variations, degrading intelligibility and prosody preservation. While prior works have primarily focused on quantized representations, quantization residuals remain underutilized and deserve further exploration. In this paper, we introduce a novel approach that fully utilizes quantization residuals by leveraging temporal properties of speech components. This facilitates the disentanglement of speaker identity and the recovery of phonetic and prosodic details lost during quantization. By applying only K-means quantization and line
Authors
(none)
Tags
Stats
Related papers
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- AVQVC: One-shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022)12.40
- VQVC+: One-shot Voice Conversion By Vector Quantization And U-net Architecture (2020)13.34
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Vec-tok-vc+: Residual-enhanced Robust Zero-shot Voice Conversion With Progressive Constraints In A Dual-mode Training Strategy (2024)3.58