Efficient Long-form Speech Recognition For General Speech In-context Learning
2024 Β· Hao Yen, Shaoshi Ling, Guoli Ye
Abstract
We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual biasing. Specifically, we introduce an attention-based encoder-decoder (AED) model with SICL capability (referred to as SICL-AED), where the decoder utilizes an utterance-level cross-attention to integrate information from the encoder's output efficiently, and a document-level self-attention to learn contextual information. Evaluated on the benchmark TEDLIUM3 dataset, SICL-AED achieves an 8.64% relative word error rate (WER) reduction compared to a baseline utterance-level AED model by leveraging previously decoded outputs as in-context examples. It also demonstrates comparable performance to conventional long-form AED systems with significantly reduced runtime and memory complexity. Additionally, we introduce an in-context fine-tuning (ICFT) technique that
Authors
(none)
Tags
Stats
Related papers
- Cif-based Collaborative Decoding For End-to-end Contextual Speech Recognition (2020)9.76
- Deep Context: End-to-end Contextual Speech Recognition (2018)15.57
- End-to-end Contextual Asr Based On Posterior Distribution Adaptation For Hybrid Ctc/attention System (2022)0.00
- End-to-end Contextual Speech Recognition Using Class Language Models And A Token Passing Decoder (2018)11.08
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- SMILE: Speech Meta In-context Learning For Low-resource Language Automatic Speech Recognition (2024)0.00
- Boundary And Context Aware Training For Cif-based Non-autoregressive End-to-end ASR (2021)7.81
- Deep Contextualized Acoustic Representations For Semi-supervised Speech Recognition (2019)14.62