Dico: Disentangled Concept Representation For Text-to-image Person Re-identification
2026 Β· Giyeol Kim, Chanho Eom
Abstract
Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit\{e.g.\}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demons
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Adaptive Dual Association For Text-to-image Person Retrieval (2023)12.02
- TF-CLIP: Learning Text-free CLIP For Video-based Person Re-identification (2023)15.81
- Deep Co-attention Based Comparators For Relative Representation Learning In Person Re-identification (2018)13.34
- Text-video Retrieval With Disentangled Conceptualization And Set-to-set Alignment (2023)11.49
- Text-guided Image Restoration And Semantic Enhancement For Text-to-image Person Retrieval (2023)9.00
- Instruct-reid++: Towards Universal Purpose Instruction-guided Person Re-identification (2024)9.13
- Advancing Person Re-identification: Tensor-based Feature Fusion And Multilinear Subspace Learning (2023)2.26
- Prototype-guided Cross-modal Completion And Alignment For Incomplete Text-based Person Re-identification (2023)6.77