Adversarial Speaker Disentanglement Using Unannotated External Data For Self-supervised Representation Based Voice Conversion
2023 Β· Xintao Zhao, Shuai Wang, Yang Chao, et al.
Abstract
Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the
Authors
(none)
Tags
Stats
Related papers
- Selfvc: Voice Conversion With Iterative Refinement Using Self Transformations (2023)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- S2VC: A Framework For Any-to-any Voice Conversion With Self-supervised Pretrained Representations (2021)12.25
- Non-parallel Sequence-to-sequence Voice Conversion With Disentangled Linguistic And Speaker Representations (2019)14.02
- Recognition-synthesis Based Non-parallel Voice Conversion With Adversarial Learning (2020)0.00
- Beyond Voice Identity Conversion: Manipulating Voice Attributes By Adversarial Learning Of Structured Disentangled Representations (2021)0.00
- Cross-domain Voice Activity Detection With Self-supervised Representations (2022)0.00
- A Comparative Study Of Self-supervised Speech Representation Based Voice Conversion (2022)9.76