See Finer, See More: Implicit Modality Alignment For Text-based Person Retrieval
2022 Β· Xiujun Shu, Wei Wen, Haoqian Wu, et al.
Abstract
Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores fin
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Implicit Relation Reasoning And Aligning For Text-to-image Person Retrieval (2023)18.15
- Cross-modal Full-mode Fine-grained Alignment For Text-to-image Person Retrieval (2025)2.23
- Multi-path Exploration And Feedback Adjustment For Text-to-image Person Retrieval (2024)0.00
- Decoupled Cross-modal Alignment Network For Text-rgbt Person Retrieval And A High-quality Benchmark (2025)0.00
- Improving Text-based Person Search Via Part-level Cross-modal Correspondence (2024)0.00
- Beat: Bi-directional One-to-many Embedding Alignment For Text-based Person Retrieval (2024)10.85
- Enhancing Visual Representation For Text-based Person Searching (2024)1.69
- Multilingual Text-to-image Person Retrieval Via Bidirectional Relation Reasoning And Aligning (2025)2.35