S3PRL-VC: Open-source Voice Conversion Framework With Self-supervised Speech Representations
2021 Β· Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, et al.
Abstract
This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not on
Authors
(none)
Tags
Stats
Related papers
- A Comparative Study Of Self-supervised Speech Representation Based Voice Conversion (2022)9.76
- S2VC: A Framework For Any-to-any Voice Conversion With Self-supervised Pretrained Representations (2021)12.25
- Selfvc: Voice Conversion With Iterative Refinement Using Self Transformations (2023)0.00
- Convs2s-vc: Fully Convolutional Sequence-to-sequence Voice Conversion (2018)12.68
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64
- Voice Reenactment With F0 And Timing Constraints And Adversarial Learning Of Conversions (2021)2.26
- A Comparative Study Of Voice Conversion Models With Large-scale Speech And Singing Data: The T13 Systems For The Singing Voice Conversion Challenge 2023 (2023)6.77
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00