RL Token: Bootstrapping Online RL With Vision-language-action Models
2026 Β· Charles Xu, Jost Tobias Springenberg, Michael Equi, et al.
Abstract
arXiv:2604.23073v2 Announce Type: replace Abstract: Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an "RL token," a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening, charger insertion, and Ethernet insertion), RLT impro
Authors
(none)
Tags
Stats
Related papers
- Enhancing Vision-language Model Training With Reinforcement Learning In Synthetic Worlds For Real-world Success (2025)0.00
- Simple Recipe Works: Vision-language-action Models Are Natural Continual Learners With Reinforcement Learning (2026)0.00
- Discovering Failure Modes In Vision-language Models Using RL (2026)0.00
- Viva: Video-trained Value Functions For Guiding Online RL From Diverse Data (2025)0.00
- AWAC: Accelerating Online Reinforcement Learning With Offline Datasets (2020)0.00
- SAC-GLAM: Improving Online RL For LLM Agents With Soft Actor-critic And Hindsight Relabeling (2024)0.00
- Mobile-r1: Towards Interactive Capability For Vlm-based Mobile Agent Via Systematic Training (2026)0.00
- Efficient Multi-turn RL For GUI Agents Via Decoupled Training And Adaptive Data Curation (2025)0.00