Listening And Seeing Again: Generative Error Correction For Audio-visual Speech Recognition
2025 Β· Rui Liu, Hongyu Yuan, Haizhou Li
Abstract
Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audi
Authors
(none)
Tags
Stats
Related papers
- Lipger: Visually-conditioned Generative Error Correction For Robust Automatic Speech Recognition (2024)2.26
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Cross-modal Global Interaction And Local Alignment For Audio-visual Speech Recognition (2023)7.50
- It's Never Too Late: Fusing Acoustic Information Into Large Language Models For Automatic Speech Recognition (2024)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Watch Or Listen: Robust Audio-visual Speech Recognition With Visual Corruption Modeling And Reliability Scoring (2023)0.00
- Large Language Model Based Generative Error Correction: A Challenge And Baselines For Speech Recognition, Speaker Tagging, And Emotion Recognition (2024)7.81
- Benchmarking Japanese Speech Recognition On ASR-LLM Setups With Multi-pass Augmented Generative Error Correction (2024)0.00