Step-audio-r1.5 Technical Report
2026 Β· Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, et al.
Abstract
arXiv:2604.25719v1 Announce Type: new Abstract: Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR
Authors
(none)
Tags
Stats
Related papers
- Thinking In Cocktail Party: Chain-of-thought And Reinforcement Learning For Target Speaker Automatic Speech Recognition (2025)0.00
- Deepsound-v1: Start To Think Step-by-step In The Audio Generation From Videos (2025)0.00
- All That Glitters Is Not Audio: Rethinking Text Priors And Audio Reliance In Audio-language Evaluation (2026)0.00
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- Towards Holistic Evaluation Of Large Audio-language Models: A Comprehensive Survey (2026)9.75
- RALL-E: Robust Codec Language Modeling With Chain-of-thought Prompting For Text-to-speech Synthesis (2024)0.00
- Audiotoolagent: An Agentic Framework For Audio-language Models (2025)2.60
- Internalizing ASR With Implicit Chain Of Thought For Efficient Speech-to-speech Conversational LLM (2024)0.00