Enhancing Visual Representation For Text-based Person Searching
2024 Β· Wei Shen, Ming Fang, Yuxia Wang, et al.
Abstract
Text-based person search aims to retrieve the matched pedestrians from a large-scale image database according to the text description. The core difficulty of this task is how to extract effective details from pedestrian images and texts, and achieve cross-modal alignment in a common latent space. Prior works adopt image and text encoders pre-trained on unimodal data to extract global and local features from image and text respectively, and then global-local alignment is achieved explicitly. However, these approaches still lack the ability of understanding visual details, and the retrieval accuracy is still limited by identity confusion. In order to alleviate the above problems, we rethink the importance of visual features for text-based person search, and propose VFE-TPS, a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained multimodal backbone CLIP to learn basic multimodal features and constructs Text Guided Masked Image Modeling task to enhance the mo
Authors
(none)
Tags
Stats
Related papers
- TIPCB: A Simple But Effective Part-based Convolutional Baseline For Text-based Person Search (2021)20.24
- Text-based Person Search With Limited Data (2021)15.69
- Multi-path Exploration And Feedback Adjustment For Text-to-image Person Retrieval (2024)0.00
- Improving Text-based Person Search Via Part-level Cross-modal Correspondence (2024)0.00
- Boosting Weak Positives For Text Based Person Search (2025)0.00
- Beat: Bi-directional One-to-many Embedding Alignment For Text-based Person Retrieval (2024)10.85
- Person Text-image Matching Via Text-feature Interpretability Embedding And External Attack Node Implantation (2022)7.16
- Text-guided Image Restoration And Semantic Enhancement For Text-to-image Person Retrieval (2023)9.00