Contextualized End-to-end Automatic Speech Recognition With Intermediate Biasing Loss
2024 Β· Muhammad Shakeel, Yui Sudo, Yifan Peng, et al.
Abstract
Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.
Authors
(none)
Tags
Stats
Related papers
- Contextualized End-to-end Speech Recognition With Contextual Phrase Prediction Network (2023)10.48
- Improving Neural Biasing For Contextual Speech Recognition By Early Context Injection And Text Perturbation (2024)8.09
- Adaptive Contextual Biasing For Transducer Based Streaming Speech Recognition (2023)7.16
- Contextualized Streaming End-to-end Speech Recognition With Trie-based Deep Biasing And Shallow Fusion (2021)13.44
- Towards Contextual Spelling Correction For Customization Of End-to-end Speech Recognition Systems (2022)9.92
- Contextualized Automatic Speech Recognition With Attention-based Bias Phrase Boosted Beam Search (2024)8.60
- Robust Acoustic And Semantic Contextual Biasing In Neural Transducers For Speech Recognition (2023)8.60
- Locality Enhanced Dynamic Biasing And Sampling Strategies For Contextual ASR (2024)0.00