Au-m-ol: A Unified Model For Medical Audio And Language Understanding
2026 Β· Meizhu Liu, Nistha Mitra, Paul Li, et al.
Abstract
arXiv:2604.23284v1 Announce Type: cross Abstract: In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a s
Authors
(none)
Tags
Stats
Related papers
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Audiotoolagent: An Agentic Framework For Audio-language Models (2025)2.60
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59