Frame-difference Guided Dynamic Region Perception For CLIP Adaptation In Text-video Retrieval
2025 Β· Jiaao Yu, Mingjie Han, Tao Gong, et al.
Abstract
With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP
Authors
(none)
Tags
Stats
Related papers
- Prompt-aware Of Frame Sampling For Efficient Text-video Retrieval (2025)0.95
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)6.30
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93
- Clip4clip: An Empirical Study Of CLIP For End To End Video Clip Retrieval (2021)6.02
- Teachclip: Multi-grained Teaching For Efficient Text-to-video Retrieval (2023)0.00
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75