Generalizing Rnn-transducer To Out-domain Audio Via Sparse Self-attention Layers
2021 Β· Juntae Kim, Jeehye Lee
Abstract
Recurrent neural network transducer (RNN-T) is an end-to-end speech recognition framework converting input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance, most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention layers for Conformer-based encoder networks, which can exploit local and generalized global information b
Authors
(none)
Tags
Stats
Related papers
- Self-attention Transducers For End-to-end Speech Recognition (2019)11.93
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Efficient Conformer With Prob-sparse Attention Mechanism For End-to-endspeech Recognition (2021)8.09
- Self-consistent Context Aware Conformer Transducer For Speech Recognition (2024)0.00
- Improving RNN Transducer Modeling For End-to-end Speech Recognition (2019)0.00
- Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss (2020)18.58
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- RNN-T Models Fail To Generalize To Out-of-domain Audio: Causes And Solutions (2020)8.82