PITS: Variational Pitch Inference Without Fundamental Frequency For End-to-end Pitch-controllable TTS
2023 Β· Junhyeok Lee, Wonbin Jung, Hyunjae Cho, et al.
Abstract
Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code, audio samples, and demo are available at https://github.com/anonymous-pits/pits.
Authors
(none)
Tags
Stats
Code
Related papers
- Period VITS: Variational Inference With Explicit Pitch Modeling For End-to-end Emotional Speech Synthesis (2022)8.60
- Fastpitch: Parallel Text-to-speech With Pitch Prediction (2020)16.23
- Enhancement Of Pitch Controllability Using Timbre-preserving Pitch Augmentation In Fastpitch (2022)0.00
- Lightweight And High-fidelity End-to-end Text-to-speech With Multi-band Generation And Inverse Short-time Fourier Transform (2022)14.57
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Using Generative Modelling To Produce Varied Intonation For Speech Synthesis (2019)7.81
- Visinger: Variational Inference With Adversarial Learning For End-to-end Singing Voice Synthesis (2021)12.99
- PAVITS: Exploring Prosody-aware VITS For End-to-end Emotional Voice Conversion (2024)8.35