Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval
2025 Β· Leqi Shen, Guoqiang Gong, Tianxiang Hao, et al.
Abstract
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-lev
Authors
(none)
Tags
Stats
Related papers
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Frame-difference Guided Dynamic Region Perception For CLIP Adaptation In Text-video Retrieval (2025)0.00
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00
- Clip2video: Mastering Video-text Retrieval Via Image CLIP (2021)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95