What You Think Is What You See: Driving Exploration In VLM Agents Via Visual-linguistic Curiosity
2026 Β· Haoxi Li, Qinglin Hou, Jianfei Ma, et al.
Abstract
arXiv:2605.03782v1 Announce Type: new Abstract: To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, s
Authors
(none)
Tags
Stats
Related papers
- See Further, Think Deeper: Advancing Vlm's Reasoning Ability With Low-level Visual Cues And Reflection (2026)0.00
- Curiosity-driven Exploration Via Latent Bayesian Surprise (2021)0.00
- From Curiosity To Competence: How World Models Interact With The Dynamics Of Exploration (2025)0.00
- Wonder Wins Ways: Curiosity-driven Exploration Through Multi-agent Contextual Calibration (2025)0.00
- Enhancing Vision-language Model Training With Reinforcement Learning In Synthetic Worlds For Real-world Success (2025)0.00
- Closed-loop Vision-language Planning For Multi-agent Coordination (2026)0.00
- Discovering Failure Modes In Vision-language Models Using RL (2026)0.00
- Curiosity-driven Multi-agent Exploration With Mixed Objectives (2022)0.00