Unsupervised End-to-end Learning Of Discrete Linguistic Units For Voice Conversion
2019 Β· Andy T. Liu, Po-Chun Hsu, Hung-Yi Lee
Abstract
We present an unsupervised end-to-end training scheme where we discover discrete subword units from speech without using any labels. The discrete subword units are learned under an ASR-TTS autoencoder reconstruction setting, where an ASR-Encoder is trained to discover a set of common linguistic units given a variety of speakers, and a TTS-Decoder trained to project the discovered units back to the designated speech. We propose a discrete encoding method, Multilabel-Binary Vectors (MBV), to make the ASR-TTS autoencoder differentiable. We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language. Therefore, the TTS-Decoder can synthesize speech with the same content as the input of ASR-Encoder but with different speaker characteristics, which achieves voice conversion (VC). We further improve the quality of VC using adversarial training, where we train a TTS-Patcher that
Authors
(none)
Tags
Stats
Related papers
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Discrete Unit Based Masking For Improving Disentanglement In Voice Conversion (2024)0.00
- A Comparison Of Discrete And Soft Speech Units For Improved Voice Conversion (2021)20.25
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- Unsupervised Acoustic Unit Discovery For Speech Synthesis Using Discrete Latent-variable Neural Networks (2019)9.59
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Disentangled Speech Representation Learning For One-shot Cross-lingual Voice Conversion Using \(\beta\)-vae (2022)7.50