Videoclip-xl: Advancing Long Description Understanding For Video CLIP Models
2024 Β· Jiapeng Wang, Chengyu Wang, Kunzhe Huang, et al.
Abstract
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct
Authors
(none)
Tags
Stats
Related papers
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- TULIP: Token-length Upgraded CLIP (2024)3.04