RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph
2025 Β· Sameer Malik, Moyuru Yamada, Ayush Singh, et al.
Abstract
Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate super
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Video-text Retrieval With Hierarchical Graph Reasoning (2020)18.27
- TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding (2025)2.86
- SALOVA: Segment-augmented Long Video Assistant For Targeted Retrieval And Routing In Long-form Video Analysis (2024)0.00
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- ICSVR: Investigating Compositional And Syntactic Understanding In Video Retrieval Models (2023)8.92
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Covr-r:reason-aware Composed Video Retrieval (2026)2.02
- MAGNET: A Multi-agent Framework For Finding Audio-visual Needles By Reasoning Over Multi-video Haystacks (2025)0.00