Learning When To Translate For Streaming Speech
2021 Β· Qianqian Dong, Yaoming Zhu, Mingxuan Wang, et al.
Abstract
How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long speech sequence, we develop an efficient monotonic segmentation module inside an encoder-decoder model to accumulate acoustic information incrementally and detect proper speech unit boundaries for the input in speech translation task. Experiments on multiple translation directions of the MuST-C dataset show that MoSST outperforms existing methods and achieves the best trade-off between translation quality (BLEU) and latency. Our code is available at https://github.com/dqqcasia/mosst.
Authors
(none)
Tags
Stats
Code
Related papers
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77
- Streamspeech: Simultaneous Speech-to-speech Translation With Multi-task Learning (2024)7.81
- Textless Streaming Speech-to-speech Translation Using Semantic Speech Tokens (2024)3.58
- End-to-end Simultaneous Speech Translation With Differentiable Segmentation (2023)7.16
- Streamatt: Direct Streaming Speech-to-text Translation With Attention-based Audio History Selection (2024)4.52
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58
- Leveraging Timestamp Information For Serialized Joint Streaming Recognition And Translation (2023)0.00
- Long-form End-to-end Speech Translation Via Latent Alignment Segmentation (2023)0.00