V-SAT: Video Subtitle Annotation Tool
2025 Β· Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, et al.
Abstract
The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to
Authors
(none)
Tags
Stats
Related papers
- VAST: A Vision-audio-subtitle-text Omni-modality Foundation Model And Dataset (2023)14.55
- Leveraging Broadcast Media Subtitle Transcripts For Automatic Speech Recognition And Subtitling (2025)2.26
- Direct Speech Translation For Automatic Subtitling (2022)6.77
- Learning To Jointly Transcribe And Subtitle For End-to-end Spontaneous Speech Recognition (2022)5.84
- Suber: A Metric For Automatic Evaluation Of Subtitle Quality (2022)0.00
- Speech Recognition On TV Series With Video-guided Post-asr Correction (2025)0.00
- Evaluating Subtitle Segmentation For End-to-end Generation Systems (2022)0.00
- Vt-ssum: A Benchmark Dataset For Video Transcript Segmentation And Summarization (2021)2.76