Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation
2023 Β· Sara Papi, Peidong Wang, Junkun Chen, et al.
Abstract
The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in computational resources, and increased synchronization complexity in real time. In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. We introduce a novel method for joint token-level serialized output training based on timestamp information to effectively produce ASR and ST outputs in the streaming setting. Experiments on \{it,es,de\}->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time
Authors
(none)
Tags
Stats
Related papers
- Token-level Serialized Output Training For Joint Streaming ASR And ST Leveraging Textual Alignments (2023)0.00
- Large-scale Streaming End-to-end Speech Translation With Neural Transducers (2022)9.59
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Transcribing And Translating, Fast And Slow: Joint Speech Translation And Recognition (2024)0.00
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77
- Blockwise Streaming Transformer For Spoken Language Understanding And Simultaneous Speech Translation (2022)4.52
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08