Abstract
Cross-modal retrieval faces new challenges due to the surge in multimedia content and internet technologies. Complex multi-label data, local semantic representation, and inter-modal feature imbalance are pitfalls for current deep cross-modal hashing techniques. We propose Feature Fusion Mamba Hashing via Decoupling (FFMHD) to address issues. A CLIP-based module aligns image-text features accurately. The decoupling enhancement module separates style and content, focusing retrieval on semantics. A hybrid state-space module combines convolutional local extraction with SS2D global perception for deep local semantics. Furthermore, a dynamic multi-semantic aligned hash loss refines sample relationships using quadratic spherical mutual information, modality-agnostic, and multi-label proxy losses. Experiments on MIRFLICKR-25 K, NUS-WIDE, and MS COCO show FFMHD outperforms chart-topper methods, achieving up to 6.1% mAP gain on I2T and 5.9% on T2I. Ablation studies validate the effectiveness of each module.