Cross-modal Implicit Relation Reasoning And Aligning For Text-to-image Person Retrieval
2023 Β· Ding Jiang, Mang Ye
Abstract
Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This
Authors
(none)
Tags
Stats
Related papers
- Multilingual Text-to-image Person Retrieval Via Bidirectional Relation Reasoning And Aligning (2025)2.35
- See Finer, See More: Implicit Modality Alignment For Text-based Person Retrieval (2022)18.39
- Cross-modal Full-mode Fine-grained Alignment For Text-to-image Person Retrieval (2025)2.23
- Cross-modal Adaptive Dual Association For Text-to-image Person Retrieval (2023)12.02
- Text-guided Image Restoration And Semantic Enhancement For Text-to-image Person Retrieval (2023)9.00
- Decoupled Cross-modal Alignment Network For Text-rgbt Person Retrieval And A High-quality Benchmark (2025)0.00
- Learning Relation Alignment For Calibrated Cross-modal Retrieval (2021)8.82
- IMRAM: Iterative Matching With Recurrent Attention Memory For Cross-modal Image-text Retrieval (2020)19.22