Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models With Diverse Speech Variabilities
2024 Β· Aulia Adila, Dessi Lestari, Ayu Purwarianti, et al.
Abstract
An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups. The best results were achieved by the Whisper fine-tune
Authors
(none)
Tags
Stats
Related papers
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- M2r-whisper: Multi-stage And Multi-scale Retrieval Augmentation For Enhancing Whisper (2024)6.77
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- XLS-R Deep Learning Model For Multilingual ASR On Low- Resource Languages: Indonesian, Javanese, And Sundanese (2024)0.00
- Pi-whisper: Designing An Adaptive And Incremental Automatic Speech Recognition System For Edge Devices (2024)0.00
- Accented Speech Recognition: Benchmarking, Pre-training, And Diverse Data (2022)0.00
- Investigating Self-supervised, Weakly Supervised And Fully Supervised Training Approaches For Multi-domain Automatic Speech Recognition: A Study On Bangladeshi Bangla (2022)0.00
- A Highly Adaptive Acoustic Model For Accurate Multi-dialect Speech Recognition (2022)10.85