Towards Zero-shot Text-based Voice Editing Using Acoustic Context Conditioning, Utterance Embeddings, And Reference Encoders
2022 Β· Jason Fong, Yun Wang, Prabhav Agrawal, et al.
Abstract
Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance embeddings and a reference encoder improve the continuity of speaker identity and prosody between the
Authors
(none)
Tags
Stats
Related papers
- Content-dependent Fine-grained Speaker Embedding For Zero-shot Speaker Adaptation In Text-to-speech Synthesis (2022)10.07
- Learning Speaker Embedding From Text-to-speech (2020)5.84
- Vevo: Controllable Zero-shot Voice Imitation With Self-supervised Disentanglement (2025)0.00
- Takin-vc: Expressive Zero-shot Voice Conversion Via Adaptive Hybrid Content Encoding And Enhanced Timbre Modeling (2024)0.00
- Yourtts: Towards Zero-shot Multi-speaker TTS And Zero-shot Voice Conversion For Everyone (2021)0.00
- Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters (2024)7.50
- Voiceshop: A Unified Speech-to-speech Framework For Identity-preserving Zero-shot Voice Editing (2024)0.00
- Incremental Disentanglement For Environment-aware Zero-shot Text-to-speech Synthesis (2024)2.26