Streaming Simultaneous Speech Translation With Augmented Memory Transformer
2020 Β· Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, et al.
Abstract
Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation, the task of generating translations from partial audio input, ignores the time spent in generating the translation when analyzing the latency. With this assumption, a system may have good latency quality trade-offs but be inapplicable in real-time scenarios. In this paper, we focus on the task of streaming simultaneous speech translation, where the systems are not only capable of translating with partial input but are also able to handle very long or continuous input. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder,
Authors
(none)
Tags
Stats
Related papers
- Implicit Memory Transformer For Computationally Efficient Simultaneous Speech Translation (2023)0.00
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Streaming Attention-based Models With Augmented Memory For End-to-end Speech Recognition (2020)5.84
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation (2023)0.00
- Large-scale Streaming End-to-end Speech Translation With Neural Transducers (2022)9.59
- Blockwise Streaming Transformer For Spoken Language Understanding And Simultaneous Speech Translation (2022)4.52