Going Down Memory Lane: Scaling Tokens For Video Stream Understanding With Dynamic Kv-cache Memory
2026 Β· Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, et al.
Abstract
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStr
Authors
(none)
Tags
Stats
Related papers
- Vqtoken: Neural Discrete Token Representation Learning For Extreme Token Reduction In Video Large Language Models (2025)0.00
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Encode The Unseen: Predictive Video Hashing For Scalable Mid-stream Retrieval (2020)3.58
- TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding (2025)2.86
- CREAM: Continual Retrieval On Dynamic Streaming Corpora With Adaptive Soft Memory (2026)0.00
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- Generative Recall, Dense Reranking: Learning Multi-view Semantic Ids For Efficient Text-to-video Retrieval (2026)0.00