A Comparative Study Of Self-supervised Speech Representation Based Voice Conversion
2022 Β· Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, et al.
Abstract
We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light
Authors
(none)
Tags
Stats
Related papers
- S3PRL-VC: Open-source Voice Conversion Framework With Self-supervised Speech Representations (2021)10.97
- S2VC: A Framework For Any-to-any Voice Conversion With Self-supervised Pretrained Representations (2021)12.25
- Selfvc: Voice Conversion With Iterative Refinement Using Self Transformations (2023)0.00
- A Comparative Study Of Voice Conversion Models With Large-scale Speech And Singing Data: The T13 Systems For The Singing Voice Conversion Challenge 2023 (2023)6.77
- A Comparison Of Discrete And Soft Speech Units For Improved Voice Conversion (2021)20.25
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Measuring The Effectiveness Of Voice Conversion On Speaker Identification And Automatic Speech Recognition Systems (2019)0.00
- Adversarial Speaker Disentanglement Using Unannotated External Data For Self-supervised Representation Based Voice Conversion (2023)6.34