Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models
2025 Β· Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, et al.
Abstract
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, re
Authors
(none)
Tags
Stats
Related papers
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition (2025)0.00
- Au-m-ol: A Unified Model For Medical Audio And Language Understanding (2026)0.00
- Multilingual And Fully Non-autoregressive ASR With Large Language Model Fusion: A Comprehensive Study (2024)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Multimodal Integration For Large-vocabulary Audio-visual Speech Recognition (2020)7.50
- Speakerlm: End-to-end Versatile Speaker Diarization And Recognition With Multimodal Large Language Models (2025)5.24