Enhancing Vision-language Model Training With Reinforcement Learning In Synthetic Worlds For Real-world Success
2025 Β· George Bredis, Stanislav Dereka, Viacheslav Sinii, et al.
Abstract
Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards,
Authors
(none)
Tags
Stats
Related papers
- RL Token: Bootstrapping Online RL With Vision-language-action Models (2026)0.00
- Simple Recipe Works: Vision-language-action Models Are Natural Continual Learners With Reinforcement Learning (2026)0.00
- Efficient Multi-turn RL For GUI Agents Via Decoupled Training And Adaptive Data Curation (2025)0.00
- Discovering Failure Modes In Vision-language Models Using RL (2026)0.00
- Closed-loop Vision-language Planning For Multi-agent Coordination (2026)0.00
- Value Augmented Sampling For Language Model Alignment And Personalization (2024)0.00
- What You Think Is What You See: Driving Exploration In VLM Agents Via Visual-linguistic Curiosity (2026)0.00
- Odysseus: Scaling Vlms To 100+ Turn Decision-making In Games Via Reinforcement Learning (2026)0.00