GLOBE: A High-quality English Corpus With Global Accents For Zero-shot Speaker Adaptive Text-to-speech
2024 Β· Wenbin Wang, Yang Song, Sanjay Jha
Abstract
This paper introduces GLOBE, a high-quality English corpus with worldwide accents, specifically designed to address the limitations of current zero-shot speaker adaptive Text-to-Speech (TTS) systems that exhibit poor generalizability in adapting to speakers with accents. Compared to commonly used English corpora, such as LibriTTS and VCTK, GLOBE is unique in its inclusion of utterances from 23,519 speakers and covers 164 accents worldwide, along with detailed metadata for these speakers. Compared to its original corpus, i.e., Common Voice, GLOBE significantly improves the quality of the speech data through rigorous filtering and enhancement processes, while also populating all missing speaker metadata. The final curated GLOBE corpus includes 535 hours of speech data at a 24 kHz sampling rate. Our benchmark results indicate that the speaker adaptive TTS model trained on the GLOBE corpus can synthesize speech with better speaker similarity and comparable naturalness than that trained on
Authors
(none)
Tags
Stats
Related papers
- English Accent Accuracy Analysis In A State-of-the-art Automatic Speech Recognition System (2021)0.00
- Gigaspeech 2: An Evolving, Large-scale And Multi-domain ASR Corpus For Low-resource Languages With Automated Crawling, Transcription And Refinement (2024)0.00
- Libritts: A Corpus Derived From Librispeech For Text-to-speech (2019)20.79
- 1000 African Voices: Advancing Inclusive Multi-speaker Multi-accent Speech Synthesis (2024)2.26
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- A Pilot Study Of Gslm-based Simulation Of Foreign Accentuation Only Using Native Speech Corpora (2024)0.00
- Gtr-voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis (2024)0.00
- Empowering Global Voices: A Data-efficient, Phoneme-tone Adaptive Approach To High-fidelity Speech Synthesis (2025)0.00