Pixel-grounded Retrieval For Knowledgeable Large Multimodal Models
2026 Β· Jeonghwan Kim, Renjie Tao, Sanat Sharma, et al.
Abstract
Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yi
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Retrieval For Knowledge-based Visual Question Answering (2024)7.81
- End-to-end Knowledge Retrieval With Multi-modal Queries (2023)8.35
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00
- Object Retrieval For Visual Question Answering With Outside Knowledge (2024)0.00
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- Murag: Multimodal Retrieval-augmented Generator For Open Question Answering Over Images And Text (2022)14.66
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- REVEAL: Retrieval-augmented Visual-language Pre-training With Multi-source Multimodal Knowledge Memory (2022)13.65