All That Glitters Is Not Audio: Rethinking Text Priors And Audio Reliance In Audio-language Evaluation
2026 Β· Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li, et al.
Abstract
arXiv:2604.24401v1 Announce Type: cross Abstract: Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reli
Authors
(none)
Tags
Stats
Related papers
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- Aqua-bench: Beyond Finding Answers To Knowing When There Are None In Audio Question Answering (2026)0.00
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- Step-audio-r1.5 Technical Report (2026)0.00
- Putting HUMANS First: Efficient LAM Evaluation With Human Preference Alignment (2026)0.00
- Walking Through Uncertainty: An Empirical Study Of Uncertainty Estimation For Audio-aware Large Language Models (2026)0.00
- Audiotoolagent: An Agentic Framework For Audio-language Models (2025)2.60