Towards Implicit Aggregation: Robust Image Representation For Place Recognition In The Transformer Era
2025 Β· Feng Lu, Tong Jin, Canming Ye, et al.
Abstract
Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful
Authors
(none)
Tags
Stats
Related papers
- \(r^{2}\)former: Unified \(r\)etrieval And \(r\)eranking Transformer For Place Recognition (2023)18.31
- Vlad-buff: Burst-aware Fast Feature Aggregation For Visual Place Recognition (2024)10.46
- Unipr-3d: Towards Universal Visual Place Recognition With Visual Geometry Grounded Transformer (2025)2.95
- Placeformer: Transformer-based Visual Place Recognition Using Multi-scale Patch Selection And Fusion (2024)7.81
- Multires-netvlad: Augmenting Place Recognition Training With Low-resolution Imagery (2022)16.01
- Optimal Transport Aggregation For Visual Place Recognition (2023)20.51
- Regressing Transformers For Data-efficient Visual Place Recognition (2024)3.58
- Attention-based Pyramid Aggregation Network For Visual Place Recognition (2018)14.11