Videoclip: Contrastive Pre-training For Zero-shot Video-text Understanding
2021 Β· Hu Xu, Gargi Ghosh, Po-Yao Huang, et al.
Abstract
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.
Authors
(none)
Tags
Stats
Code
Related papers
- Fitclip: Refining Large-scale Pretrained Image-text Models For Zero-shot Video Understanding Tasks (2022)1.91
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Superclip: CLIP With Simple Classification Supervision (2025)0.00
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Videoclip-xl: Advancing Long Description Understanding For Video CLIP Models (2024)8.35