Abstract
Hashing techniques are widely adopted in large-scale retrieval due to their low time and space complexity. Existing deep cross-modal hashing methods mostly rely on mini-batch training, where only a limited number of samples are processed in each iteration, often resulting in incomplete neighborhood exploration and sub-optimal embedding learning, especially on complex multi-label datasets. To address this limitation, recent studies have optimized the SoftMax loss, which is essentially equivalent to a smoothed triplet constraint with a single center assigned to each class. However, in practical retrieval scenarios, class distributions in the embedding space often contain multiple semantic clusters. Modeling each class with only one center fails to capture these intra-class local structures, thereby widening the semantic gap between heterogeneous samples. To alleviate this issue, we propose a novel deep cross-modal hashing framework, Deep SoftTriple Hashing (DSTH), which learns compact hash codes to better preserve semantic similarities in the embedding space. The framework introduces multiple centers for each class to effectively model the implicit distribution of heterogeneous samples and reduce intra-class semantic variance. To determine the number of centers, a class-center strategy is developed, where similar centers are encouraged to aggregate through an L2,1 regularization to obtain a compact set of centers. In addition, a semantic position quantization loss is introduced to minimize quantization error and enhance the discriminability of binary codes. Extensive experiments on three multi-label datasets demonstrate the effectiveness of DSTH in cross-modal retrieval, achieving absolute mAP improvements of 1.2%-9.3% over strong baselines while consistently maintaining superior performance on PR curves. The source code is available at: https://github.com/QinLab-WFU/DSTH-SoftTriple.