LM-VC: Zero-shot Voice Conversion Via Speech Generation Based On Language Models
2023 Β· Zhichao Wang, Yuanzhe Chen, Lei Xie, et al.
Abstract
Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM - Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, le
Authors
(none)
Tags
Stats
Related papers
- SLMGAN: Exploiting Speech Language Model Representations For Unsupervised Zero-shot Voice Conversion In Gans (2023)0.00
- Streamvoice: Streamable Context-aware Language Modeling For Real-time Zero-shot Voice Conversion (2024)7.16
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Stargan-zsvc: Towards Zero-shot Voice Conversion In Low-resource Contexts (2021)3.58
- End-to-end Zero-shot Voice Conversion With Location-variable Convolutions (2022)7.50
- Hierspeech++: Bridging The Gap Between Semantic And Acoustic Representation Of Speech By Hierarchical Variational Inference For Zero-shot Speech Synthesis (2023)6.19
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58