Visual Representation Learning With Self-supervised Attention For Low-label High-data Regime
2022 Β· Prarthana Bhattacharyya, Chenge Li, Xiaonan Zhao, et al.
Abstract
Self-supervision has shown outstanding results for natural language processing, and more recently, for image recognition. Simultaneously, vision transformers and its variants have emerged as a promising and scalable alternative to convolutions on various computer vision tasks. In this paper, we are the first to question if self-supervised vision transformers (SSL-ViTs) can be adapted to two important computer vision tasks in the low-label, high-data regime: few-shot image classification and zero-shot image retrieval. The motivation is to reduce the number of manual annotations required to train a visual embedder, and to produce generalizable and semantically meaningful embeddings. For few-shot image classification we train SSL-ViTs without any supervision, on external data, and use this trained embedder to adapt quickly to novel classes with limited number of labels. For zero-shot image retrieval, we use SSL-ViTs pre-trained on a large dataset without any labels and fine-tune them with
Authors
(none)
Tags
Stats
Related papers
- Analyzing Local Representations Of Self-supervised Vision Transformers (2023)0.00
- VISER: Visual Self-regularization (2018)0.00
- Self-supervised Vision Transformers For Writer Retrieval (2024)5.24
- Improving Spatiotemporal Self-supervision By Deep Reinforcement Learning (2018)13.50
- Boosting Vision Transformers For Image Retrieval (2022)15.28
- Improve Supervised Representation Learning With Masked Image Modeling (2023)0.00
- Training Vision Transformers For Image Retrieval (2021)0.00
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00