Artseek: Deep Artwork Understanding Via Multimodal In-context Reasoning And Late Interaction Retrieval
2025 Β· Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Abstract
Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-ground
Authors
(none)
Tags
Stats
Related papers
- How To Read Paintings: Semantic Art Understanding With Multi-modal Retrieval (2018)13.93
- Context-aware Embeddings For Automatic Art Analysis (2019)12.54
- Iart: A Search Engine For Art-historical Images To Support Research In The Humanities (2021)8.35
- Deepimagesearch: Benchmarking Multimodal Agents For Context-aware Image Retrieval In Visual Histories (2026)0.00
- RAVENEA: A Benchmark For Multimodal Retrieval-augmented Visual Culture Understanding (2025)0.00
- Visual Link Retrieval And Knowledge Discovery In Painting Datasets (2020)12.25
- Object Retrieval And Localization In Large Art Collections Using Deep Multi-style Feature Fusion And Iterative Voting (2021)7.50
- Artistmus: A Globally Diverse, Artist-centric Benchmark For Retrieval-augmented Music Question Answering (2025)0.00