Billion-scale Pretraining With Vision Transformers For Multi-task Visual Representations
2021 Β· Josh Beal, Hao-Yu Wu, Dong Huk Park, et al.
Abstract
Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively unexplored. We consider the case of a popular visual discovery product, where these representations are trained with multi-task learning, from use-case specific visual understanding (e.g. skin tone classification) to general representation learning for all visual content (e.g. embeddings for retrieval). In this work, we describe how we (1) generate a dataset with over a billion images via large weakly-supervised pretraining to improve the performance of these visual representations, and (2) leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially at 1B+ image scale. To support this backbone model, we detail a systematic approach to deriving weakly-supervised imag
Authors
(none)
Tags
Stats
Related papers
- Boosting Vision Transformers For Image Retrieval (2022)15.28
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Training Vision Transformers For Image Retrieval (2021)0.00
- Understanding The Effect Of Using Semantically Meaningful Tokens For Visual Representation Learning (2024)0.00
- Siamese Vision Transformers Are Scalable Audio-visual Learners (2024)7.47
- Analyzing Local Representations Of Self-supervised Vision Transformers (2023)0.00
- Visual Representation Learning With Self-supervised Attention For Low-label High-data Regime (2022)5.49
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85