Abstract

Retrieving relevant observations from long multi-modal web interaction histories is challenging because relevance depends on the evolving task state, modality (screenshots, HTML text, structured signals), and temporal distance. Prior approaches typically rely on static similarity thresholds or fixed-capacity buffers, which fail to adapt relevance to the current task context. We propose \textbf\{ACGM\}, a learned graph-memory retriever that constructs *task-adaptive* relevance graphs over agent histories using policy-gradient optimization from downstream task success. ACGM captures heterogeneous temporal dynamics with modality-specific decay (visual decays \(4.3\times\) faster than text: \(\lambda_v\{=\}0.47\) vs.\ \(\lambda_x\{=\}0.11\)) and learns sparse connectivity (3.2 edges/node), enabling efficient \(O(log T)\) retrieval. Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to \textbf\{82.7 nDCG@10\} (+9.3 over GPT-4o, \(p\{<\}0.001\)) and \textbf\{89.2%

Authors

(none)

Tags

  • Image Retrieval

Stats

Related papers