Attention-based Multimodal Image Matching
2021 Β· Aviad Moreshet, Yosi Keller
Abstract
We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Visual Textual Alignment For Cross-modal Retrieval Using Transformer Encoders (2020)19.48
- Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features (2021)6.34
- Transformer Reasoning Network For Image-text Matching And Retrieval (2020)16.15
- Decoupling The Role Of Data, Attention, And Losses In Multimodal Transformers (2021)13.88
- Transmatcher: Deep Image Matching Through Transformers For Generalizable Person Re-identification (2021)4.68
- Unifying Two-stream Encoders With Transformers For Cross-modal Retrieval (2023)13.89
- MVAM: Multi-view Attention Method For Fine-grained Image-text Matching (2024)0.00
- Matching Images And Text With Multi-modal Tensor Fusion And Re-ranking (2019)19.77