Efficient And Robust Long-form Speech Recognition With Hybrid H3-conformer
2024 Β· Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara
Abstract
Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hungry Hippos (H3), to replace or complement the multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form sequences with a linear-order computation. In experiments using two datasets of CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and MHSA and show that using H3 in higher layers and MHSA in lower layers provides significant improvement in online recognition. We also investigate a parallel use of H3 and MHSA in all layers, resulting in the best performance.
Authors
(none)
Tags
Stats
Related papers
- Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition (2023)14.47
- Efficient Conformer: Progressive Downsampling And Grouped Attention For Automatic Speech Recognition (2021)13.79
- Conformer-based Hybrid ASR System For Switchboard Dataset (2021)9.41
- Universal Paralinguistic Speech Representations Using Self-supervised Conformers (2021)10.48
- Practice Of The Conformer Enhanced AUDIO-VISUAL HUBERT On Mandarin And English (2023)4.52
- Nextformer: A Convnext Augmented Conformer For End-to-end Speech Recognition (2022)0.00
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Efficient Conformer With Prob-sparse Attention Mechanism For End-to-endspeech Recognition (2021)8.09