Low-latency Real-time Non-parallel Voice Conversion Based On Cyclic Variational Autoencoder And Multiband Wavernn With Data-driven Linear Prediction
2021 Β· Patrick Lumban Tobing, Tomoki Toda
Abstract
This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral features given the spectral features of an input speaker. On the other hand, MWDLP is an efficient and a high-quality neural vocoder that can handle multispeaker data and generate speech waveform for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture. Further, to improve the modeling performance, we also propose a novel fine-tuning procedure that refines the frame-rate CycleVAE network by utilizing the waveform loss from the MWDLP network. T
Authors
(none)
Tags
Stats
Related papers
- High-fidelity And Low-latency Universal Neural Vocoder Based On Multiband Wavernn With Data-driven Linear Prediction For Discrete Waveform Modeling (2021)6.77
- Non-parallel Voice Conversion With Cyclic Variational Autoencoder (2019)12.10
- Baseline System Of Voice Conversion Challenge 2020 With Cyclic Variational Autoencoder And Parallel Wavegan (2020)4.24
- Vocoder-free Non-parallel Conversion Of Whispered Speech With Masked Cycle-consistent Generative Adversarial Networks (2023)0.00
- Refined Wavenet Vocoder For Variational Autoencoder Based Voice Conversion (2018)7.50
- Parallel-data-free Voice Conversion Using Cycle-consistent Adversarial Networks (2017)0.00
- CVC: Contrastive Learning For Non-parallel Voice Conversion (2020)7.50
- Dualvc 3: Leveraging Language Model Generated Pseudo Context For End-to-end Low Latency Streaming Voice Conversion (2024)0.00