Google Crowdsourced Speech Corpora And Related Open-source Resources For Low-resource Languages And Dialects: An Overview
2020 Β· Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, et al.
Abstract
This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
Authors
(none)
Tags
Stats
Related papers
- Gigaspeech 2: An Evolving, Large-scale And Multi-domain ASR Corpus For Low-resource Languages With Automated Crawling, Transcription And Refinement (2024)0.00
- Crowdspeech And Voxdiy: Benchmark Datasets For Crowdsourced Audio Transcription (2021)0.00
- A Crowdsourced Open-source Kazakh Speech Corpus And Initial Speech Recognition Baseline (2020)10.85
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- What Shall We Do With An Hour Of Data? Speech Recognition For The Un- And Under-served Languages Of Common Voice (2021)0.00
- MSR-86K: An Evolving, Multilingual Corpus With 86,300 Hours Of Transcribed Audio For Speech Recognition Research (2024)4.52
- Leveraging Translations For Speech Transcription In Low-resource Settings (2018)6.77
- The People's Speech: A Large-scale Diverse English Speech Recognition Dataset For Commercial Usage (2021)0.00