PREGEN: Uncovering Latent Thoughts In Composed Video Retrieval
2026 Β· Gabriele Serussi, David Vainshtein, Jonathan Kouchly, et al.
Abstract
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates
Authors
(none)
Tags
Stats
Related papers
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- Covr-r:reason-aware Composed Video Retrieval (2026)2.02
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- From Play To Replay: Composed Video Retrieval For Temporally Fine-grained Videos (2025)0.00
- Beyond Simple Edits: Composed Video Retrieval With Dense Modifications (2025)2.16
- Egocvr: An Egocentric Benchmark For Fine-grained Composed Video Retrieval (2024)10.00
- Imagine Before Concentration: Diffusion-guided Registers Enhance Partially Relevant Video Retrieval (2026)3.80
- ICSVR: Investigating Compositional And Syntactic Understanding In Video Retrieval Models (2023)8.92