Scaling Nvidia's Multi-speaker Multi-lingual TTS Systems With Zero-shot TTS To Indic Languages
2024 Β· Akshit Arora, Rohan Badlani, Sungwon Kim, et al.
Abstract
In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.
Authors
(none)
Tags
Stats
Related papers
- The THU-HCSI Multi-speaker Multi-lingual Few-shot Voice Cloning System For LIMMITS'24 Challenge (2024)0.00
- Towards Building Text-to-speech Systems For The Next Billion Users (2022)0.00
- ZMM-TTS: Zero-shot Multilingual And Multispeaker Speech Synthesis Conditioned On Self-supervised Discrete Speech Representations (2023)10.35
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- Mulantts: The Microsoft Speech Synthesis System For Blizzard Challenge 2023 (2023)5.84
- MSV Challenge 2022: NPU-HC Speaker Verification System For Low-resource Indian Languages (2022)0.00
- Fast And Small Footprint Hybrid Hmm-hifigan Based System For Speech Synthesis In Indian Languages (2023)0.00
- Deep Voice 2: Multi-speaker Neural Text-to-speech (2017)0.00