Abstract

Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: *(i)* a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and *(ii)* an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing

Authors

(none)

Tags

  • Image Retrieval

Stats

  • citations1
  • S2 citationsβ€”
  • github stars1
  • HF likes0
  • heat score2.86
  • arxiv keycao2025tv

Related papers