Learning When To Speak: Latency And Quality Trade-offs For Simultaneous Speech-to-speech Translation With Offline Models
2023 Β· Liam Dugan, Anshul Wadhawan, Kyle Spence, et al.
Abstract
Recent work in speech-to-speech translation (S2ST) has focused primarily on offline settings, where the full input utterance is available before any output is given. This, however, is not reasonable in many real-world scenarios. In latency-sensitive applications, rather than waiting for the full utterance, translations should be spoken as soon as the information in the input is present. In this work, we introduce a system for simultaneous S2ST targeting real-world use cases. Our system supports translation from 57 languages to English with tunable parameters for dynamically adjusting the latency of the output -- including four policies for determining when to speak an output sequence. We show that these policies achieve offline-level accuracy with minimal increases in latency over a Greedy (wait-\(k\)) baseline. We open-source our evaluation code and interactive test script to aid future SimulS2ST research and application development.
Authors
(none)
Tags
Stats
Related papers
- From Start To Finish: Latency Reduction Strategies For Incremental Speech Synthesis In Simultaneous Speech-to-speech Translation (2021)2.26
- Does Simultaneous Speech Translation Need Simultaneous Models? (2022)4.52
- Simuls2s-llm: Unlocking Simultaneous Inference Of Speech Llms For Speech-to-speech Translation (2025)3.58
- Streamspeech: Simultaneous Speech-to-speech Translation With Multi-task Learning (2024)7.81
- Fluent And Low-latency Simultaneous Speech-to-speech Translation With Self-adaptive Training (2020)3.58
- CA*: Addressing Evaluation Pitfalls In Computation-aware Latency For Simultaneous Speech Translation (2024)0.00
- Efficient And Adaptive Simultaneous Speech Translation With Fully Unidirectional Architecture (2025)2.26
- Better Late Than Never: Meta-evaluation Of Latency Metrics For Simultaneous Speech-to-text Translation (2025)1.81