Task-adaptive Retrieval Over Agentic Multi-modal Web Histories Via Learned Graph Memory

Abstract

Retrieving relevant observations from long multi-modal web interaction histories is challenging because relevance depends on the evolving task state, modality (screenshots, HTML text, structured signals), and temporal distance. Prior approaches typically rely on static similarity thresholds or fixed-capacity buffers, which fail to adapt relevance to the current task context. We propose \textbf\{ACGM\}, a learned graph-memory retriever that constructs *task-adaptive* relevance graphs over agent histories using policy-gradient optimization from downstream task success. ACGM captures heterogeneous temporal dynamics with modality-specific decay (visual decays \(4.3\times\) faster than text: \(\lambda_v\{=\}0.47\) vs.\ \(\lambda_x\{=\}0.11\)) and learns sparse connectivity (3.2 edges/node), enabling efficient \(O(log T)\) retrieval. Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to \textbf\{82.7 nDCG@10\} (+9.3 over GPT-4o, \(p\{<\}0.001\)) and \textbf\{89.2%

Task-adaptive Retrieval Over Agentic Multi-modal Web Histories Via Learned Graph Memory

Abstract

Authors

Tags

Stats

Related papers