Formalized Information Needs Improve Large-language-model Relevance Judgments
2026 · Jüri Keller, Maik Fröbe, Björn Engelmann, et al.
Abstract
Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on Robust04 and the 2019/2020 editions of TREC Deep Learning. We find that assessors without formalization judg
Authors
(none)
Tags
Stats
Related papers
- Rebol: Retrieval Via Bayesian Optimization With Batched LLM Relevance Observations And Query Reformulation (2026)0.00
- One-shot Labeling For Automatic Relevance Estimation (2023)12.25
- Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms (2025)2.26
- Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation (2024)0.00
- The Overlooked Role Of Graded Relevance Thresholds In Multilingual Dense Retrieval (2026)0.00
- A Comparative Study Of Specialized Llms As Dense Retrievers (2025)2.26
- Enrichindex: Using Llms To Enrich Retrieval Indices Offline (2025)0.00
- Scaling Sparse And Dense Retrieval In Decoder-only Llms (2025)6.34