Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR
2021 Β· Junkun Chen, Mingbo Ma, Renjie Zheng, et al.
Abstract
Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation qual
Authors
(none)
Tags
Stats
Related papers
- Synchronous Speech Recognition And Speech-to-text Translation With Interactive Decoding (2019)10.48
- Streamspeech: Simultaneous Speech-to-speech Translation With Multi-task Learning (2024)7.81
- Efficient And Adaptive Simultaneous Speech Translation With Fully Unidirectional Architecture (2025)2.26
- Joint Training And Decoding For Multilingual End-to-end Simultaneous Speech Translation (2025)0.95
- Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation (2023)0.00
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58
- Token-level Serialized Output Training For Joint Streaming ASR And ST Leveraging Textual Alignments (2023)0.00