SPAN: Learning Similarity Between Scene Graphs And Images With Transformers
2023 Β· Yuren Cong, Wentong Liao, Bodo Rosenhahn, et al.
Abstract
Learning similarity between scene graphs and images aims to estimate a similarity score given a scene graph and an image. There is currently no research dedicated to this task, although it is critical for scene graph generation and downstream applications. Scene graph generation is conventionally evaluated by Recall\(@K\) and mean Recall\(@K\), which measure the ratio of predicted triplets that appear in the human-labeled triplet set. However, such triplet-oriented metrics fail to demonstrate the overall semantic difference between a scene graph and an image and are sensitive to annotation bias and noise. Using generated scene graphs in the downstream applications is therefore limited. To address this issue, for the first time, we propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images. Our novel framework consists of a graph Transformer and an image Transformer to align scene graphs and their corresponding image
Authors
(none)
Tags
Stats
Related papers
- Image-to-image Retrieval By Learning Similarity Between Scene Graphs (2020)12.02
- Scene Graph Embeddings Using Relative Similarity Supervision (2021)7.50
- Zero-shot Sketch Based Image Retrieval Using Graph Transformer (2022)6.77
- Triplet-aware Scene Graph Embeddings (2019)7.81
- Scene Text Retrieval Via Joint Text Detection And Similarity Learning (2021)16.16
- Video-language Alignment Via Spatio-temporal Graph Transformer (2024)0.00
- SCENIR: Visual Semantic Clarity Through Unsupervised Scene Graph Retrieval (2025)0.00
- Vista: Vision And Scene Text Aggregation For Cross-modal Retrieval (2022)14.31