Audiotoolagent: An Agentic Framework For Audio-language Models
2025 Β· Gijs Wijngaard, Elia Formisano, Michel Dumontier, et al.
Abstract
Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multistep reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent reasons about which tools to invoke, how to formulate follow-up queries, and how to arbitrate conflicting tool outputs, without accessing the audio. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 77.50% in MMAU, 77.00% in MMAR, and 61.90% in MMAU-Pro. Shapley-based analysis identifies effective agent-tool combinations. The code and reproduction materials are available at https://github.com/GLJS/AudioToolAgent.
Authors
(none)
Tags
Stats
Code
Related papers
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- From Alignment To Advancement: Bootstrapping Audio-language Alignment With Synthetic Data (2025)2.26
- Audiorag+: Feedback-driven Retrieval-augmented Audio Generation With Large Audio Language Models (2025)0.00
- Au-m-ol: A Unified Model For Medical Audio And Language Understanding (2026)0.00
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75