TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding

Abstract

Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: *(i)* a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and *(ii)* an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing

TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding

Abstract

Authors

Tags

Stats

Related papers