Transformers And Cnns Both Beat Humans On SBIR
2022 · Omar Seddati, Stéphane Dupont, Saïd Mahmoudi, et al.
Abstract
Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to horizontal flip (even after model finetuning) is harming performance. To overcome this limitation, we propose several approaches and evaluate in depth each of them to check their effectiveness. Our main contributions are twofold: We propose and evaluate several intuitive modifications to build SBIR solutions with better flip equivariance. We show that vision transformers are more suited for the SBIR task, and that they outperform CNNs with a large margin. We carried out numerous experiments and introduce the first models to outperform human performance on a large-scale SBIR benchmark (Sketchy)
Authors
(none)
Tags
Stats
Related papers
- Generalisation And Sharing In Triplet Convnets For Sketch Based Visual Search (2016)13.11
- A Recipe For Efficient SBIR Models: Combining Relative Triplet Loss With Batch Normalization And Knowledge Distillation (2023)0.00
- An Efficient Framework For Zero-shot Sketch-based Image Retrieval (2021)13.65
- Zero-shot Sketch Based Image Retrieval Using Graph Transformer (2022)6.77
- Exploiting Unlabelled Photos For Stronger Fine-grained SBIR (2023)10.61
- Back To The Drawing Board: Rethinking Scene-level Sketch-based Image Retrieval (2025)0.00
- A Zero-shot Framework For Sketch-based Image Retrieval (2018)16.49
- Boosting Vision Transformers For Image Retrieval (2022)15.28