Large Language Models Are Strong Audio-visual Speech Recognition Learners
2024 Β· Umberto Cappellazzo, Minsu Kim, Honglie Chen, et al.
Abstract
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as o
Authors
(none)
Tags
Stats
Related papers
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition (2025)0.00
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00