See Further, Think Deeper: Advancing Vlm's Reasoning Ability With Low-level Visual Cues And Reflection

·2026

arXiv:wu2026see ↗Google Scholar ↗Semantic Scholar ↗

Abstract

arXiv:2604.24339v1 Announce Type: cross Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf\{ForeSight\}, which enables VLMs to \textbf\{See Further\} with low-level visual cues and \textbf\{Think Deeper\} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invo

Abstract

Related papers