Large-scale Streaming End-to-end Speech Translation With Neural Transducers
2022 Β· Jian Xue, Peidong Wang, Jinyu Li, et al.
Abstract
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.
Authors
(none)
Tags
Stats
Related papers
- LAMASSU: Streaming Language-agnostic Multilingual Speech Recognition And Translation Using Neural Transducers (2022)7.50
- Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation (2023)0.00
- Label-synchronous Neural Transducer For E2E Simultaneous Speech Translation (2024)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Developing Real-time Streaming Transformer Transducer For Speech Recognition On Large-scale Dataset (2020)0.00
- Streaming Simultaneous Speech Translation With Augmented Memory Transformer (2020)6.77
- Multilingual End-to-end Speech Translation (2019)0.00
- Token-level Serialized Output Training For Joint Streaming ASR And ST Leveraging Textual Alignments (2023)0.00