Convis-bench: Estimating Video Similarity Through Semantic Concepts
2025 Β· Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, et al.
Abstract
What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotat
Authors
(none)
Tags
Stats
Related papers
- On Semantic Similarity In Video Retrieval (2021)12.81
- Visil: Fine-grained Spatio-temporal Video Similarity Learning (2019)13.70
- Sovabench: A Vehicle Surveillance Action Retrieval Benchmark For Multimodal Large Language Models (2026)0.00
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Relevance-based Margin For Contrastively-trained Video Retrieval Models (2022)7.74
- Semantic Video Moments Retrieval At Scale: A New Task And A Baseline (2022)0.00
- Self-supervised Video Similarity Learning (2023)13.04
- Momentseeker: A Task-oriented Benchmark For Long-video Moment Retrieval (2025)0.00