Convoice: Real-time Zero-shot Voice Style Transfer With Convolutional Network
2020 Β· Yurii Rebryk, Stanislav Beliaev
Abstract
We propose a neural network for zero-shot voice conversion (VC) without any parallel or transcribed data. Our approach uses pre-trained models for automatic speech recognition (ASR) and speaker embedding, obtained from a speaker verification task. Our model is fully convolutional and non-autoregressive except for a small pre-trained recurrent neural network for speaker encoding. ConVoice can convert speech of any length without compromising quality due to its convolutional architecture. Our model has comparable quality to similar state-of-the-art models while being extremely fast.
Authors
(none)
Tags
Stats
Related papers
- AUTOVC: Zero-shot Voice Style Transfer With Only Autoencoder Loss (2019)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- End-to-end Zero-shot Voice Conversion With Location-variable Convolutions (2022)7.50
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- Voicy: Zero-shot Non-parallel Voice Conversion In Noisy Reverberant Environments (2021)5.24
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Improving Zero-shot Voice Style Transfer Via Disentangled Representation Learning (2021)0.00