Mobile-r1: Towards Interactive Capability For Vlm-based Mobile Agent Via Systematic Training
2026 Β· Jihao Gu, Qihang Ai, Yingyao Wang, et al.
Abstract
arXiv:2506.20332v4 Announce Type: replace Abstract: Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present \textbf\{Mobile-R1\}, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-le
Authors
(none)
Tags
Stats
Related papers
- Inquiremobile: Teaching Vlm-based Mobile Agent To Request Human Assistance Via Reinforcement Fine-tuning (2026)0.00
- Efficient Multi-turn RL For GUI Agents Via Decoupled Training And Adaptive Data Curation (2025)0.00
- Digi-q: Learning Q-value Functions For Training Device-control Agents (2025)0.00
- Enhancing Vision-language Model Training With Reinforcement Learning In Synthetic Worlds For Real-world Success (2025)0.00
- What You Think Is What You See: Driving Exploration In VLM Agents Via Visual-linguistic Curiosity (2026)0.00
- Distrl: An Asynchronous Distributed Reinforcement Learning Framework For On-device Control Agents (2024)0.00
- K^2-agent: Co-evolving Know-what And Know-how For Hierarchical Mobile Device Control (2026)0.00
- Closed-loop Vision-language Planning For Multi-agent Coordination (2026)0.00