Unienc-cassnat: An Encoder-only Non-autoregressive ASR For Speech SSL Models
2024 Β· Ruchao Fan, Natarajan Balaji Shanka, Abeer Alwan
Abstract
Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation
Authors
(none)
Tags
Stats
Related papers
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97
- CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer For Speech Recognition (2020)10.74
- An Improved Single Step Non-autoregressive Transformer For Automatic Speech Recognition (2021)0.00
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- Conformer-based Target-speaker Automatic Speech Recognition For Single-channel Audio (2023)9.41
- Non-autoregressive End-to-end Approaches For Joint Automatic Speech Recognition And Spoken Language Understanding (2023)5.84
- Joint Masked CPC And CTC Training For ASR (2020)8.60
- Non-autoregressive Transformer ASR With Ctc-enhanced Decoder Input (2020)10.97