Dualvc 3: Leveraging Language Model Generated Pseudo Context For End-to-end Low Latency Streaming Voice Conversion
2024 Β· Ziqian Ning, Shuai Wang, Pengcheng Zhu, et al.
Abstract
Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable perf
Authors
(none)
Tags
Stats
Related papers
- Dualvc 2: Dynamic Masked Convolution For Unified Streaming And Non-streaming Voice Conversion (2023)5.84
- Streamvoice: Streamable Context-aware Language Modeling For Real-time Zero-shot Voice Conversion (2024)7.16
- Fastvc: Fast Voice Conversion With Non-parallel Data (2020)5.24
- Atts2s-vc: Sequence-to-sequence Voice Conversion With Attention And Context Preservation Mechanisms (2018)14.15
- Voice Conversion Using Sequence-to-sequence Learning Of Context Posterior Probabilities (2017)11.39
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64
- AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion (2021)7.50
- Voice Conversion Can Improve ASR In Very Low-resource Settings (2021)7.50