U2++: Unified Two-pass Bidirectional End-to-end Model For Speech Recognition
2021 Β· di Wu, Binbin Zhang, Chao Yang, et al.
Abstract
The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5% - 8% word error rate reduction gain over U2. On the experiment of AISHELL-1, we achieve a 4.63% character error rate (CER) with a non-streaming setup and 5.05% with a stream
Authors
(none)
Tags
Stats
Related papers
- Fast-u2++: Fast And Accurate End-to-end Speech Recognition In Joint Ctc/attention Frames (2022)5.84
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Enhancing The Unified Streaming And Non-streaming Model With Contrastive Learning (2023)3.58
- Two-stage Augmentation And Adaptive CTC Fusion For Improved Robustness Of Multi-stream End-to-end ASR (2021)2.26
- Unity: Two-pass Direct Speech-to-speech Translation With Discrete Units (2022)9.59
- UFO2: A Unified Pre-training Framework For Online And Offline Speech Recognition (2022)4.52
- Cuside-array: A Streaming Multi-channel End-to-end Speech Recognition System With Realistic Evaluations (2024)0.00
- Modality Confidence Aware Training For Robust End-to-end Spoken Language Understanding (2023)2.26