Leveraging Broadcast Media Subtitle Transcripts For Automatic Speech Recognition And Subtitling
2025 Β· Jakob Poncelet, Hugo van Hamme
Abstract
The recent advancement of speech recognition technology has been driven by large-scale datasets and attention-based architectures, but many challenges still remain, especially for low-resource languages and dialects. This paper explores the integration of weakly supervised transcripts from TV subtitles into automatic speech recognition (ASR) systems, aiming to improve both verbatim transcriptions and automatically generated subtitles. To this end, verbatim data and subtitles are regarded as different domains or languages, due to their distinct characteristics. We propose and compare several end-to-end architectures that are designed to jointly model both modalities with separate or shared encoders and decoders. The proposed methods are able to jointly generate a verbatim transcription and a subtitle. Evaluation on Flemish (Belgian Dutch) demonstrates that a model with cascaded encoders and separate decoders allows to represent the differences between the two data types most efficiently
Authors
(none)
Tags
Stats
Related papers
- Learning To Jointly Transcribe And Subtitle For End-to-end Spontaneous Speech Recognition (2022)5.84
- Subtitles To Segmentation: Improving Low-resource Speech-to-text Translation Pipelines (2020)0.00
- Direct Speech Translation For Automatic Subtitling (2022)6.77
- Dodging The Data Bottleneck: Automatic Subtitling With Automatically Segmented ST Corpora (2022)2.26
- End-to-end Multimodal Speech Recognition (2018)10.21
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Between Flexibility And Consistency: Joint Generation Of Captions And Subtitles (2021)5.24
- Speech Recognition On TV Series With Video-guided Post-asr Correction (2025)0.00