Slideavsr: A Dataset Of Paper Explanation Videos For Audio-visual Speech Recognition
2024 Β· Hao Wang, Shuhei Kurita, Shuichiro Shimizu, et al.
Abstract
Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
Authors
(none)
Tags
Stats
Related papers
- Chinese-lips: A Chinese Audio-visual Speech Recognition Dataset With Lip-reading And Presentation Slides (2025)0.00
- XLAVS-R: Cross-lingual Audio-visual Speech Representation Learning For Noise-robust Speech Perception (2024)7.50
- AKVSR: Audio Knowledge Empowered Visual Speech Recognition By Compressing Audio Knowledge Of A Pretrained Model (2023)8.35
- Mavils, A Benchmark Dataset For Video-to-slide Alignment, Assessing Baseline Accuracy With A Multimodal Alignment Algorithm Leveraging Speech, OCR, And Visual Features (2024)3.58
- Vt-ssum: A Benchmark Dataset For Video Transcript Segmentation And Summarization (2021)2.76
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00