R4: Retrieval-augmented Reasoning For Vision-language Models In 4D Spatio-temporal Space
2025 Β· Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, et al.
Abstract
Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in
Authors
(none)
Tags
Stats
Related papers
- Open-world 3D Scene Graph Generation For Retrieval-augmented Reasoning (2025)0.00
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Leveraging Retrieval-augmented Tags For Large Vision-language Understanding In Complex Scenes (2024)0.00
- Reasoning-augmented Representations For Multimodal Retrieval (2026)0.00
- Remote Sensing Retrieval-augmented Generation: Bridging Remote Sensing Imagery And Comprehensive Knowledge With A Multi-modal Dataset And Retrieval-augmented Generation Model (2025)2.26
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- VISOR: Agentic Visual Retrieval-augmented Generation Via Iterative Search And Over-horizon Reasoning (2026)0.00