Self-supervised Video Retrieval Transformer Network
2021 Β· Xiangteng He, Yulin Pan, Mingqian Tang, et al.
Abstract
Content-based video retrieval aims to find videos from a large video database that are similar to or even near-duplicate of a given query video. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. We propose a novel video retrieval system, termed SVRTN, that effectively addresses the above shortcomings. It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary a
Authors
(none)
Tags
Stats
Related papers
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Ts2-net: Token Shift And Selection Transformer For Text-video Retrieval (2022)15.51
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Transvcl: Attention-enhanced Video Copy Localization Network With Flexible Supervision (2022)13.47
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00
- Training Vision Transformers For Image Retrieval (2021)0.00
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46