Patcorrect: Non-autoregressive Phoneme-augmented Transformer For ASR Error Correction
2023 Β· Ziji Zhang, Zhehui Wang, Rajesh Kamma, et al.
Abstract
Speech-to-text errors made by automatic speech recognition (ASR) systems negatively impact downstream models. Error correction models as a post-processing text editing method have been recently developed for refining the ASR outputs. However, efficient models that meet the low latency requirements of industrial grade production systems have not been well studied. We propose PATCorrect-a novel non-autoregressive (NAR) approach based on multi-modal fusion leveraging representations from both text and phoneme modalities, to reduce word error rate (WER) and perform robustly with varying input transcription quality. We demonstrate that PATCorrect consistently outperforms state-of-the-art NAR method on English corpus across different upstream ASR systems, with an overall 11.62% WER reduction (WERR) compared to 9.46% WERR achieved by other methods using text only modality. Besides, its inference latency is at tens of milliseconds, making it ideal for systems with low latency requirements.
Authors
(none)
Tags
Stats
Related papers
- Cross-modal ASR Post-processing System For Error Correction And Utterance Rejection (2022)0.00
- Text-conditioned Transformer For Automatic Pronunciation Error Detection (2020)10.48
- Robust Automatic Speech Recognition Via Wavaugment Guided Phoneme Adversarial Training (2023)0.00
- PROCTER: Pronunciation-aware Contextual Adapter For Personalized Speech Recognition In Neural Transducers (2023)8.60
- Ed-cec: Improving Rare Word Recognition Using Asr Postprocessing Based On Error Detection And Context-aware Error Correction (2023)6.34
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Performance Improvements Of Probabilistic Transcript-adapted ASR With Recurrent Neural Network And Language-specific Constraints (2016)0.00
- Paraformer: Fast And Accurate Parallel Transformer For Non-autoregressive End-to-end Speech Recognition (2022)15.10