Improve Supervised Representation Learning With Masked Image Modeling
2023 Β· Kaifeng Chen, Daniel Salz, Huiwen Chang, et al.
Abstract
Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the original classification task applied to a vision transformer image encoder, we add a shallow transformer-based decoder on top of the encoder and introduce an MIM task which tries to reconstruct image tokens based on masked image inputs. We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations for downstream tasks such as classification, image retrieval, and semantic segmentation. We conduct a comprehensive study and evaluation of our setup on public benchmarks. On ImageNet-1k, our ViT-B/14 model achieves 81.
Authors
(none)
Tags
Stats
Related papers
- Visual Representation Learning With Self-supervised Attention For Low-label High-data Regime (2022)5.49
- Analyzing Local Representations Of Self-supervised Vision Transformers (2023)0.00
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- MILES: Visual BERT Pre-training With Injected Language Semantics For Video-text Retrieval (2022)10.61
- Mask To Reconstruct: Cooperative Semantics Completion For Video-text Retrieval (2023)5.24
- Masked Vision-language Transformer In Fashion (2022)12.41
- Steerable Visual Representations (2026)0.00
- Masked Contrastive Pre-training For Efficient Video-text Retrieval (2022)5.84