Visil: Fine-grained Spatio-temporal Video Similarity Learning
2019 Β· Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, et al.
Abstract
In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such relations are typically lost in previous video retrieval approaches that embed the whole frame or even the whole video into a vector descriptor before the similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video similarity from refined frame-to-frame similarity matrices, so as to consider both intra- and inter-frame relations. In the proposed method, pairwise frame similarity is estimated by applying Tensor Dot (TD) followed by Chamfer Similarity (CS) on regional CNN frame features - this avoids feature aggregation before the similarity calculation between frames. Subsequently, the similarity matrix between all video frames is fed to a four-layer CNN, and then summarized using Chamfer Similarity (CS) into a video-to-video similarity score -- this avoids fea
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Video Similarity Learning (2023)13.04
- 3D-CSL: Self-supervised 3D Context Similarity Learning For Near-duplicate Video Retrieval (2022)6.34
- Convis-bench: Estimating Video Similarity Through Semantic Concepts (2025)0.00
- Audio-based Near-duplicate Video Retrieval With Audio Similarity Learning (2020)7.16
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57
- Differentiable Resolution Compression And Alignment For Efficient Video Classification And Retrieval (2023)5.27
- TCLR: Temporal Contrastive Learning For Video Representation (2021)15.78
- Learning Segment Similarity And Alignment In Large-scale Content Based Video Retrieval (2023)11.08