TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding
2025 Β· Zongsheng Cao, Yangfan He, Anran Liu, et al.
Abstract
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: *(i)* a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and *(ii)* an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing
Authors
(none)
Tags
Stats
Related papers
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00
- Lazyvlm: Neuro-symbolic Approach To Video Analytics (2025)0.00
- SALOVA: Segment-augmented Long Video Assistant For Targeted Retrieval And Routing In Long-form Video Analysis (2024)0.00
- A Clip-hitchhiker's Guide To Long Video Retrieval (2022)0.00
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00