Abstract
Recent advancements in deep learning have led to significant achievements in hashing for image retrieval. However, existing methods primarily operate under the assumption that training and testing data share the same distribution, meaning that the categories in the training and test sets are identical. This assumption may not hold in real-world scenarios, potentially limiting the effectiveness of these methods. In this work, we investigate the performance of existing deep hashing methods on unseen category data during retrieval tests and find a considerable performance decline. To address this issue, we propose a Hierarchical Text-guided Hashing (HTH) framework to mitigate the performance degradation in open-world image retrieval. Specifically, our method is trained in a self-supervised learning (SSL) framework using automatically synthesized coarse-to-fine textual descriptions. By combining the strengths of SSL in learning discriminative low- and mid-level features with the semantic richness of hierarchically structured text, our approach aims to enhance the modelβs ability to generalize across unseen categories and complex open-world settings. Technically, we elaborately design a local attention pooling module to fuse the local patch information. Furthermore, we propose both hierarchical and fine-grained alignment modules, respectively applied to the global and local vision-language representations at different semantic levels, guiding the hash encoding to fully understand the visual primitives and extract discriminative and generalizable semantic information from images. Under the newly established large-scale ImageNet-CoG open evaluation protocol, our method demonstrates significant improvements in generalization compared to state-of-the-art and also possesses enhanced performance across various other open-world retrieval datasets and scenarios.