Fine-tuning Whisper On Low-resource Languages For Real-world Applications
2024 · Vincenzo Timmel, Claudio Paonessa, Reza Kakooee, et al.
Abstract
This paper presents a new approach to fine-tuning OpenAI's Whisper model for low-resource languages by introducing a novel data generation method that converts sentence-level data into a long-form corpus, using Swiss German as a case study. Non-sentence-level data, which could improve the performance of long-form audio, is difficult to obtain and often restricted by copyright laws. Our method bridges this gap by transforming more accessible sentence-level data into a format that preserves the model's ability to handle long-form audio and perform segmentation without requiring non-sentence-level data. Our data generation process improves performance in several real-world applications and leads to the development of a new state-of-the-art speech-to-text (STT) model for Swiss German. We compare our model with a non-fine-tuned Whisper and our previous state-of-the-art Swiss German STT models, where our new model achieves higher BLEU scores. Our results also indicate that the proposed metho
Authors
(none)
Tags
Stats
Related papers
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Contextual Biasing To Improve Domain-specific Custom Vocabulary Audio Transcription Without Explicit Fine-tuning Of Whisper Model (2024)4.52
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Generating Data With Text-to-speech And Large-language Models For Conversational Speech Recognition (2024)6.34
- Whispervc: Decoupled Cross-domain Alignment And Speech Generation For Low-resource Whisper-to-normal Conversion (2025)0.00
- Whisper Turns Stronger: Augmenting Wav2vec 2.0 For Superior ASR In Low-resource Languages (2024)0.00
- Weighted Cross-entropy For Low-resource Languages In Multilingual Speech Recognition (2024)6.34
- Generative Models For Improved Naturalness, Intelligibility, And Voicing Of Whispered Speech (2022)6.34