Token-level Serialized Output Training For Joint Streaming ASR And ST Leveraging Textual Alignments
2023 Β· Sara Papi, Peidong Wang, Junkun Chen, et al.
Abstract
In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\\{de,es,it\\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.
Authors
(none)
Tags
Stats
Related papers
- Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation (2023)0.00
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Large-scale Streaming End-to-end Speech Translation With Neural Transducers (2022)9.59
- Blockwise Streaming Transformer For Spoken Language Understanding And Simultaneous Speech Translation (2022)4.52
- Dual-decoder Transformer For Joint Automatic Speech Recognition And Multilingual Speech Translation (2020)13.73
- Transcribing And Translating, Fast And Slow: Joint Speech Translation And Recognition (2024)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77