Speech Is More Than Words: Do Speech-to-text Translation Systems Leverage Prosody?
2024 Β· Ioannis Tsiamas, Matthias Sperber, Andrew Finch, et al.
Abstract
The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into
Authors
(none)
Tags
Stats
Related papers
- End-to-end Speech-to-text Translation: A Survey (2023)0.00
- A Unit-based System And Dataset For Expressive Direct Speech-to-speech Translation (2025)2.26
- Isochrony-controlled Speech-to-text Translation: A Study On Translating From Sino-tibetan To Indo-european Languages (2024)0.00
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Simple And Effective Multi-sentence TTS With Expressive And Coherent Prosody (2022)7.16
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Bridging The Modality Gap For Speech-to-text Translation (2020)0.00
- Location, Location: Enhancing The Evaluation Of Text-to-speech Synthesis Using The Rapid Prosody Transcription Paradigm (2021)3.58