From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition
2025 · Rishabh Jain, Naomi Harte
Abstract
Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from visual understanding or stronger language modeling. In this work, we systematically evaluate LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, comparing adaptation strategies and architectures, and varying training data across LRS2, LRS3, and their combination. Evaluation on LRS2, LRS3, and WildVSR shows that scaling and adaptation yield limited improvements, while combining datasets enhances generalization. Semantic analysis reveals that gains arise primarily from lexical rather than semantic processing. Our Llama-2-13B model trained on the combined set achieves 24.7% WER on LRS3 and 47.0% on WildVSR, establishing SOTA among models trained without additional supervision. Our findings indicate LLM decoders refin
Authors
(none)
Tags
Stats
Related papers
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- Large Language Model Guided Decoding For Self-supervised Speech Recognition (2025)0.00
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- On Decoder-only Architecture For Speech-to-text And Large Language Model Integration (2023)0.00
- Enhancing The Stability Of Llm-based Speech Generation Systems Through Self-supervised Representations (2024)0.00