Few-Shot Learning Towards Voice Cloning

Abstract

Voice cloning model aims to generate and synthesise speech while preserving the speaker’s identity along with prosody characteristics using short-duration audio samples. The proposed pipeline includes audio preprocessing of the input audio, followed by the XTTS model used as the main reference-conditioned TTS model, and YourTTS serves as the secondary synthesiser for enhancing variations based on the expressions, along with the fallback mechanism for obtaining stable output. The generated speech is enhanced using DSP techniques and adaptive prosody correction. Obtaining pitch, energy along with the duration matrix shows the errors with reduced and accurate results. High-quality cross-lingual voice cloning is supported. Thus the proposed model has a robust voice-cloning framework that supports multiple languages, improves quality of sound, fixes rhythm and pronunciation, and adjusts the style of speaking.

Abstract

Related papers