Event-retriever: Event-aware Multimodal Image Retrieval For Realistic Captions
2025 Β· Dinh-Khoi Vo, van-Loc Nguyen, Minh-Triet Tran, et al.
Abstract
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating t
Authors
(none)
Tags
Stats
Related papers
- Recap: Event-aware Image Captioning With Article Retrieval And Semantic Gaussian Normalization (2025)1.56
- Event-enriched Image Analysis Grand Challenge At ACM Multimedia 2025 (2025)4.52
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Zse-cap: A Zero-shot Ensemble For Image Retrieval And Prompt-guided Captioning (2025)0.00
- Scene Graph Based Image Retrieval -- A Case Study On The CLEVR Dataset (2019)0.00
- Knowledge Completes The Vision: A Multimodal Entity-aware Retrieval-augmented Generation Framework For News Image Captioning (2025)0.00
- Multivent 2.0: A Massive Multilingual Benchmark For Event-centric Video Retrieval (2024)3.58
- Revising Image-text Retrieval Via Multi-modal Entailment (2022)0.00