TVPR: Text-to-video Person Retrieval And A New Benchmark
2023 Β· Xu Zhang, Fan Ni, Guan-Nan Dong, et al.
Abstract
Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, termed as Text-to-Video Person Re-identification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion conflicting and variable motion details. Specifically, we establish two potential cross-modal spaces for text
Authors
(none)
Tags
Stats
Related papers
- Continual Text-to-video Retrieval With Frame Fusion And Task-aware Routing (2025)8.75
- TF-CLIP: Learning Text-free CLIP For Video-based Person Re-identification (2023)15.81
- Learning Modal-invariant And Temporal-memory For Video-based Visible-infrared Person Re-identification (2022)14.23
- PRVR: Partially Relevant Video Retrieval (2022)2.26
- Gmmformer: Gaussian-mixture-model Based Transformer For Efficient Partially Relevant Video Retrieval (2023)12.06
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- Text-based Aerial-ground Person Retrieval (2025)2.08
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00