Integration Of Frame- And Label-synchronous Beam Search For Streaming Encoder-decoder Speech Recognition
2023 Β· Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, et al.
Abstract
Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder mitigates this issue using soft attention to the input, while it tends to overestimate labels biased towards its training domain, unlike CTC. We exploit these complementary attributes and propose to integrate the frame- and label-synchronous (F-/L-Sync) decoding alternately performed within a single beam-search scheme. F-Sync decoding leads the decoding for block-wise processing, while L-Sync decoding provides the prioritized hypotheses using look-ahead future frames within a block. We maintain the hypotheses from both decoding methods to perform effective pruning. Experiments demonstrate that the proposed search algorithm achieves lower error rates compared to the other search methods, while being robust against out-of-domain situations.
Authors
(none)
Tags
Stats
Related papers
- Streaming Parallel Transducer Beam Search With Fast-slow Cascaded Encoders (2022)0.00
- Combining Frame-synchronous And Label-synchronous Systems For Speech Recognition (2021)0.00
- Run-and-back Stitch Search: Novel Block Synchronous Decoding For Streaming Encoder-decoder ASR (2022)3.58
- Joint Beam Search Integrating CTC, Attention, And Transducer Decoders (2024)5.24
- Vectorization Of Hypotheses And Speech For Faster Beam Search In Encoder Decoder-based Speech Recognition (2018)0.00
- Robust Beam Search For Encoder-decoder Attention Based Speech Recognition Without Length Bias (2020)4.52
- Segment-level Vectorized Beam Search Based On Partially Autoregressive Inference (2023)0.00
- High Performance Sequence-to-sequence Model For Streaming Speech Recognition (2020)3.58