Video-colbert: Contextualized Late Interaction For Text-to-video Retrieval
2025 Β· Arun Reddy, Alexander Martin, Eugene Yang, et al.
Abstract
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
Authors
(none)
Tags
Stats
Related papers
- Colbertv2: Effective And Efficient Retrieval Via Lightweight Late Interaction (2021)17.46
- Colbert-att: Late-interaction Meets Attention For Enhanced Retrieval (2026)0.00
- Disentangled Representation Learning For Text-video Retrieval (2022)0.00
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Introducing Neural Bag Of Whole-words With Colberter: Contextualized Late Interactions Using Enhanced Reduction (2022)0.00
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Colbert: Efficient And Effective Passage Search Via Contextualized Late Interaction Over BERT (2020)0.00
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52