Cross-modal Fusion Distillation For Fine-grained Sketch-based Image Retrieval
2022 Β· Abhra Chaudhuri, Massimiliano Mancini, Yanbei Chen, et al.
Abstract
Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art res
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Hierarchical Modelling For Fine-grained Sketch Based Image Retrieval (2020)6.77
- Learning Cross-modal Deep Embeddings For Multi-object Image Retrieval Using Text And Sketch (2018)9.59
- Sketch And Text Synergy: Fusing Structural Contours And Descriptive Attributes For Fine-grained Image Retrieval (2026)0.00
- You'll Never Walk Alone: A Sketch And Text Duet For Fine-grained Image Retrieval (2024)9.41
- Retrieval-guided Cross-view Image Synthesis (2024)0.00
- Cross-modal Subspace Learning For Fine-grained Sketch-based Image Retrieval (2017)13.34
- Modality-aware Representation Learning For Zero-shot Sketch-based Image Retrieval (2024)8.60
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00