Rethinking Benchmarks For Cross-modal Image-text Retrieval
2023 Β· Weijing Chen, Linli Yao, Qin Jin
Abstract
Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image sid
Authors
(none)
Tags
Stats
Related papers
- Benchmark Granularity And Model Robustness For Image-text Retrieval (2024)0.00
- Revisiting Cross Modal Retrieval (2018)0.00
- Benchmarking Robustness Of Text-image Composed Retrieval (2023)2.23
- Image-text Retrieval Via Preserving Main Semantics Of Vision (2023)10.22
- Revisiting Oxford And Paris: Large-scale Image Retrieval Benchmarking (2018)17.97
- Scene-centric Vs. Object-centric Image-text Cross-modal Retrieval: A Reproducibility Study (2023)5.24
- Where Does The Performance Improvement Come From? -- A Reproducibility Concern About Image-text Retrieval (2022)3.36
- Few Shots Text To Image Retrieval: New Benchmarking Dataset And Optimization Methods (2026)0.00