Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models
2025 Β· Haolin He, Xingjian Du, Renhe Sun, et al.
Abstract
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio
Authors
(none)
Tags
Stats
Related papers
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- Audiotoolagent: An Agentic Framework For Audio-language Models (2025)2.60
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- All That Glitters Is Not Audio: Rethinking Text Priors And Audio Reliance In Audio-language Evaluation (2026)0.00
- Audiobench: A Universal Benchmark For Audio Large Language Models (2024)10.21
- From Alignment To Advancement: Bootstrapping Audio-language Alignment With Synthetic Data (2025)2.26