AppWorld

Emerging

15papers using it

2025first seen

The 'AppWorld' dataset/benchmark contains a collection of applications and their associated contexts, used to evaluate the ability of language agents to ground user instructions in the relevant environmental information.

🔎 Find this dataset

Papers using AppWorld (15)

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws2026

ACCORD: Action-Conditioned Contextual Grounding for Language Agents2026

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents2026

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents2026

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation2026

Metis: Bridging Text and Code Memory for Self-Evolving Agents2026

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents2026

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents2026

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents2026

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution2026

Three Roles, One Model: Role Orchestration At Inference Time To Close The Performance Gap Between Small And Large Agents2026

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents2026

Reinforcement Learning for Self-Improving Agent with Skill Library2025

ACON: Optimizing Context Compression for Long-horizon LLM Agents2025

Prost: Progressive Sub-task Training For Pareto-optimal Multi-agent Systems Using Small Language Models2025