Dodging The Data Bottleneck: Automatic Subtitling With Automatically Segmented ST Corpora
2022 Β· Sara Papi, Alina Karakanta, Matteo Negri, et al.
Abstract
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance,
Authors
(none)
Tags
Stats
Related papers
- Direct Speech Translation For Automatic Subtitling (2022)6.77
- Subtitles To Segmentation: Improving Low-resource Speech-to-text Translation Pipelines (2020)0.00
- Between Flexibility And Consistency: Joint Generation Of Captions And Subtitles (2021)5.24
- Leveraging Broadcast Media Subtitle Transcripts For Automatic Speech Recognition And Subtitling (2025)2.26
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Towards Unsupervised Speech-to-text Translation (2018)0.00
- Tackling Data Scarcity In Speech Translation Using Zero-shot Multilingual Machine Translation Techniques (2022)2.26
- Speech Segmentation Optimization Using Segmented Bilingual Speech Corpus For End-to-end Speech Translation (2022)5.84