SALOVA: Segment-augmented Long Video Assistant For Targeted Retrieval And Routing In Long-form Video Analysis
2024 Β· Junho Kim, Hyunjun Kim, Hosu Lee, et al.
Abstract
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-tempo
Authors
(none)
Tags
Stats
Related papers
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding (2025)2.86
- LOVO: Efficient Complex Object Query In Large-scale Video Datasets (2025)2.26
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Multimodal Lengthy Videos Retrieval Framework And Evaluation Metric (2025)0.00
- Momentseeker: A Task-oriented Benchmark For Long-video Moment Retrieval (2025)0.00
- Semantic Video Moments Retrieval At Scale: A New Task And A Baseline (2022)0.00