Mixspeech: Data Augmentation For Low-resource Automatic Speech Recognition
2021 Β· Linghui Meng, Jin Xu, Xu Tan, et al.
Abstract
In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of 10.6\(%\) on TIMIT dataset, and achieves a strong WER of 4.7\(%\) on WSJ dataset.
Authors
(none)
Tags
Stats
Related papers
- Data Augmentation For End-to-end Code-switching Speech Recognition (2020)9.92
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- ASR Data Augmentation In Low-resource Settings Using Cross-lingual Multi-speaker TTS And Cross-lingual Voice Conversion (2022)6.77
- Frustratingly Easy Data Augmentation For Low-resource ASR (2025)0.00
- Improving Low Resource Code-switched ASR Using Augmented Code-switched TTS (2020)7.50
- Data Augmentation Methods For End-to-end Speech Recognition On Distant-talk Scenarios (2021)6.34
- Segaugment: Maximizing The Utility Of Speech Translation Data With Segmentation-based Augmentations (2022)0.00
- Speech Synthesis As Augmentation For Low-resource ASR (2020)0.00