Transcribing And Translating, Fast And Slow: Joint Speech Translation And Recognition
2024 Β· Niko Moritz, Ruiming Xie, Yashesh Gaur, et al.
Abstract
We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.
Authors
(none)
Tags
Stats
Related papers
- Synchronous Speech Recognition And Speech-to-text Translation With Interactive Decoding (2019)10.48
- Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation (2023)0.00
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Dual-decoder Transformer For Joint Automatic Speech Recognition And Multilingual Speech Translation (2020)13.73
- Joint Training And Decoding For Multilingual End-to-end Simultaneous Speech Translation (2025)0.95
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Streamspeech: Simultaneous Speech-to-speech Translation With Multi-task Learning (2024)7.81
- Token-level Serialized Output Training For Joint Streaming ASR And ST Leveraging Textual Alignments (2023)0.00