Self-supervised Vision Transformers For Writer Retrieval
2024 Β· Tim Raven, Arthur Matei, Gernot A. Fink
Abstract
While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1% mAP), and the HisIR19 dataset (95.0% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6% mAP) without any fine-tuning.
Authors
(none)
Tags
Stats
Related papers
- Analyzing Local Representations Of Self-supervised Vision Transformers (2023)0.00
- Boosting Vision Transformers For Image Retrieval (2022)15.28
- VITR: Augmenting Vision Transformers With Relation-focused Learning For Cross-modal Information Retrieval (2023)4.52
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Vision Transformer Hashing For Image Retrieval (2021)17.01
- Writer Identification And Writer Retrieval Based On Netvlad With Re-ranking (2020)8.82
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Training Vision Transformers For Image Retrieval (2021)0.00