Mellotron: Multispeaker Expressive Voice Synthesis By Conditioning On Rhythm, Pitch And Global Style Tokens
2019 Β· Rafael Valle, Jason Li, Ryan Prenger, et al.
Abstract
Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.
Authors
(none)
Tags
Stats
Related papers
- Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions (2017)24.07
- Msdtron: A High-capability Multi-speaker Speech Synthesis System For Diverse Data Using Characteristic Information (2021)4.52
- Comelsinger: Discrete Token-based Zero-shot Singing Synthesis With Structured Melody Control And Guidance (2025)0.00
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Mixer-tts: Non-autoregressive, Fast And Compact Text-to-speech Model Conditioned On Language Model Embeddings (2021)6.34
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Tacotron: Towards End-to-end Speech Synthesis (2017)0.00