Learning Disentangled Speech Representations With Contrastive Learning And Time-invariant Retrieval
2024 Β· Yimin Deng, Huaizhen Tang, Xulong Zhang, et al.
Abstract
Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of conver
Authors
(none)
Tags
Stats
Related papers
- Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion (2022)11.08
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Improving Zero-shot Voice Style Transfer Via Disentangled Representation Learning (2021)0.00
- AVQVC: One-shot Voice Conversion By Vector Quantization With Applying Contrastive Learning (2022)12.40
- Learning In Your Voice: Non-parallel Voice Conversion Based On Speaker Consistency Loss (2020)0.00
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Discrete Unit Based Masking For Improving Disentanglement In Voice Conversion (2024)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00