Abstract

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution \(p(z)\) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same \(p(z)\) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for

Authors

(none)

Tags

  • Text-to-Speech
  • Voice Cloning

Stats

  • citations9
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score7.50
  • arxiv keylei2022glow

Related papers