Streaming Decoder-only Automatic Speech Recognition With Discrete Speech Units: A Pilot Study
2024 Β· Peikun Chen, Sining Sun, Changhao Shan, et al.
Abstract
Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decode
Authors
(none)
Tags
Stats
Related papers
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58
- Chunked Attention-based Encoder-decoder Model For Streaming Speech Recognition (2023)7.81
- Streaming Joint Speech Recognition And Disfluency Detection (2022)0.00
- Loss Masking Is Not Needed In Decoder-only Transformer For Discrete-token-based ASR (2023)10.56
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58
- Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation With Whisper (2024)2.26
- Decoder-only Architecture For Speech Recognition With CTC Prompts And Text Data Augmentation (2023)0.00