AppWorld
Emerging15papers using it
2025first seen
The 'AppWorld' dataset/benchmark contains a collection of applications and their associated environments, used to evaluate the performance of language agents in understanding and executing user instructions within complex contexts.
Papers using AppWorld (15)
- From Confident Closing to Silent Failure: Characterizing False Success in LLM AgentsFrom Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness FlawsWhen Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent SwarmsACCORD: Action-Conditioned Contextual Grounding for Language AgentsHera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM AgentsExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM AgentsPushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and ActivationHINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon AgentsKeep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use AgentsThree Roles, One Model: Role Orchestration At Inference Time To Close The Performance Gap Between Small And Large AgentsCoEvolve: Training LLM Agents via Agent-Data Mutual EvolutionToward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM AgentsReinforcement Learning for Self-Improving Agent with Skill LibraryACON: Optimizing Context Compression for Long-horizon LLM AgentsProst: Progressive Sub-task Training For Pareto-optimal Multi-agent Systems Using Small Language Models