Segaug: Ctc-aligned Segmented Augmentation For Robust Rnn-transducer Based Speech Recognition
2025 Β· Khanh Le, Tuan Vu Ho, Dung Tran, et al.
Abstract
RNN-Transducer (RNN-T) is a widely adopted architecture in speech recognition, integrating acoustic and language modeling in an end-to-end framework. However, the RNN-T predictor tends to over-rely on consecutive word dependencies in training data, leading to high deletion error rates, particularly with less common or out-of-domain phrases. Existing solutions, such as regularization and data augmentation, often compromise other aspects of performance. We propose SegAug, an alignment-based augmentation technique that generates contextually varied audio-text pairs with low sentence-level semantics. This method encourages the model to focus more on acoustic features while diversifying the learned textual patterns of its internal language model, thereby reducing deletion errors and enhancing overall performance. Evaluations on the LibriSpeech and Tedlium-v3 datasets demonstrate a relative WER reduction of up to 12.5% on small-scale and 6.9% on large-scale settings. Notably, most of the imp
Authors
(none)
Tags
Stats
Related papers
- Segaugment: Maximizing The Utility Of Speech Translation Data With Segmentation-based Augmentations (2022)0.00
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Improved Robustness To Disfluencies In Rnn-transducer Based Speech Recognition (2020)8.82
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- S-transformer: Segment-transformer For Robust Neural Speech Synthesis (2020)0.00
- Exploring Pre-training With Alignments For RNN Transducer Based End-to-end Speech Recognition (2020)9.41
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00
- Speech Recognition With Augmented Synthesized Speech (2019)13.97