A Reference Architecture For Agentic Hybrid Retrieval In Dataset Search
2026 Β· Riccardo Terrenzi, Phongsakon Mark Konrad, Tim Lukas Adam, et al.
Abstract
Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a software-architecture problem and propose a bounded, auditable reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion (RRF), orchestrated by a large language model (LLM) agent that repeatedly plans queries, evaluates the sufficiency of results, and reranks candidates. To reduce the vocabulary mismatch between user intent and provider-authored metadata, we introduce an offline metadata augmentation step in which an LLM generates pseudo-queries for each dataset record, augmenting both retrieval indexes before query time. Two architectural styles are examined: a single ReAct agent and a multi-agent horizontal architecture with Feedback Control. Their quality-attribute tradeoff
Authors
(none)
Tags
Stats
Related papers
- DAT: Dynamic Alpha Tuning For Hybrid Retrieval In Retrieval-augmented Generation (2025)0.00
- Hetarag: Hybrid Deep Retrieval-augmented Generation Across Heterogeneous Data Stores (2025)3.27
- Advancing Retrieval-augmented Generation For Structured Enterprise And Internal Data (2025)1.20
- Domain-adaptive And Scalable Dense Retrieval For Content-based Recommendation (2026)0.00
- Utilizing Metadata For Better Retrieval-augmented Generation (2026)0.00
- Rear: Retrieve, Expand And Refine For Effective Multitable Retrieval (2025)0.00
- Searchgym: A Modular Infrastructure For Cross-platform Benchmarking And Hybrid Search Orchestration (2026)0.00
- Rethinking Hybrid Retrieval: When Small Embeddings And LLM Re-ranking Beat Bigger Models (2025)0.00