T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval
2025 Β· Yili Li, Gang Xiong, Gaopeng Gou, et al.
Abstract
Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, whic
Authors
(none)
Tags
Stats
Related papers
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)6.30
- T2VLAD: Global-local Sequence Alignment For Text-video Retrieval (2021)16.65
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm (2024)4.52
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46