Don't Discard Fixed-window Audio Segmentation In Speech-to-text Translation
2022 Β· Chantal Amrhein, Barry Haddow
Abstract
For real-life applications, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem. In this paper, we compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings and report results on translation quality, flicker and delay. Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
Authors
(none)
Tags
Stats
Related papers
- Beyond Voice Activity Detection: Hybrid Audio Segmentation For Direct Speech Translation (2021)0.00
- Speech Segmentation Optimization Using Segmented Bilingual Speech Corpus For End-to-end Speech Translation (2022)5.84
- Simultaneous Translation For Unsegmented Input: A Sliding Window Approach (2022)0.00
- Impact Of Encoding And Segmentation Strategies On End-to-end Simultaneous Speech Translation (2021)4.52
- Smart Speech Segmentation Using Acousto-linguistic Features With Look-ahead (2022)0.00
- Long-form End-to-end Speech Translation Via Latent Alignment Segmentation (2023)0.00
- Subtitles To Segmentation: Improving Low-resource Speech-to-text Translation Pipelines (2020)0.00
- Long-form Speech Translation Through Segmentation With Finite-state Decoding Constraints On Large Language Models (2023)0.00