Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech
2024 Β· Anastasia Avdeeva, Aleksei Gusev
Abstract
Zero-shot voice conversion aims to transfer the voice of a source speaker to that of a speaker unseen during training, while preserving the content information. Although various methods have been proposed to reconstruct speaker information in generated speech, there is still room for improvement in achieving high similarity between generated and ground truth recordings. Furthermore, zero-shot voice conversion for speech in specific domains, such as whispered, remains an unexplored area. To address this problem, we propose a SpeakerVC model that can effectively perform zero-shot speech conversion in both voiced and whispered domains, while being lightweight and capable of running in streaming mode without significant quality degradation. In addition, we explore methods to improve the quality of speaker identity transfer and demonstrate their effectiveness for a variety of voice conversion systems.
Authors
(none)
Tags
Stats
Related papers
- SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System For Both Human Beings And Machines (2021)8.09
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00
- Zero-shot Voice Conversion Via Self-supervised Prosody Representation Learning (2021)6.34
- Voicy: Zero-shot Non-parallel Voice Conversion In Noisy Reverberant Environments (2021)5.24
- Training Robust Zero-shot Voice Conversion Models With Self-supervised Features (2021)7.16
- Stargan-zsvc: Towards Zero-shot Voice Conversion In Low-resource Contexts (2021)3.58
- SEF-VC: Speaker Embedding Free Zero-shot Voice Conversion With Cross Attention (2023)0.00