The People's Speech: A Large-scale Diverse English Speech Recognition Dataset For Commercial Usage
2021 Β· Daniel Galvez, Greg Diamos, Juan Ciro, et al.
Abstract
The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.
Authors
(none)
Tags
Stats
Related papers
- Crowdspeech And Voxdiy: Benchmark Datasets For Crowdsourced Audio Transcription (2021)0.00
- Voxlingua107: A Dataset For Spoken Language Recognition (2020)14.15
- Speech Commands: A Dataset For Limited-vocabulary Speech Recognition (2018)0.00
- Wenetspeech: A 10000+ Hours Multi-domain Mandarin Corpus For Speech Recognition (2021)16.12
- Dailytalk: Spoken Dialogue Dataset For Conversational Text-to-speech (2022)0.00
- Google Crowdsourced Speech Corpora And Related Open-source Resources For Low-resource Languages And Dialects: An Overview (2020)0.00
- Spgispeech: 5,000 Hours Of Transcribed Financial Audio For Fully Formatted End-to-end Speech Recognition (2021)0.00
- Gigaspeech 2: An Evolving, Large-scale And Multi-domain ASR Corpus For Low-resource Languages With Automated Crawling, Transcription And Refinement (2024)0.00