Chinese-lips: A Chinese Audio-visual Speech Recognition Dataset With Lip-reading And Presentation Slides
2025 Β· Jinghua Zhao, Yuhang Jia, Shiyao Wang, et al.
Abstract
Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8% and 25%, respectively, with a combined performance im
Authors
(none)
Tags
Stats
Related papers
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Slideavsr: A Dataset Of Paper Explanation Videos For Audio-visual Speech Recognition (2024)4.52
- VILAS: Exploring The Effects Of Vision And Language Context In Automatic Speech Recognition (2023)3.58
- Lira: Learning Visual Speech Representations From Audio Through Self-supervision (2021)11.58
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Lipger: Visually-conditioned Generative Error Correction For Robust Automatic Speech Recognition (2024)2.26
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Improved Lite Audio-visual Speech Enhancement (2020)11.39