Framework For Curating Speech Datasets And Evaluating ASR Systems: A Case Study For Polish
2024 · Michał Junczyk
Abstract
Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard) along with curated datasets (https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2, https://hug
Authors
(none)
Tags
Stats
Related papers
- Speechcolab Leaderboard: An Open-source Platform For Automatic Speech Recognition Evaluation (2024)9.05
- Speechrole: A Large-scale Dataset And Benchmark For Evaluating Speech Role-playing Agents (2025)1.91
- Open ASR Leaderboard: Towards Reproducible And Transparent Multilingual And Long-form Speech Recognition Evaluation (2025)0.00
- A Toolbox For Construction And Analysis Of Speech Datasets (2021)0.00
- Sd-eval: A Benchmark Dataset For Spoken Dialogue Understanding Beyond Words (2024)11.32
- Accented Speech Recognition: Benchmarking, Pre-training, And Diverse Data (2022)0.00
- Libriheavymix: A 20,000-hour Dataset For Single-channel Reverberant Multi-talker Speech Separation, ASR And Speaker Diarization (2024)5.24
- Crowdspeech And Voxdiy: Benchmark Datasets For Crowdsourced Audio Transcription (2021)0.00