OSWorld
Canonical28papers using it
2024first seen
OSWorld is a benchmark that contains configurations for multi-application environments used to evaluate the performance of computer-use agents (CUAs) in interacting with graphical desktops.
Papers using OSWorld (28)
- Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use AgentsMobile-agent-v3.5: Multi-platform Fundamental GUI AgentsOS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using AgentMulti-Agent Computer UseMacArena: Benchmarking Computer Use Agents on an Online macOS EnvironmentWindows Agent Arena: Evaluating Multi-modal OS Agents At ScaleiOSWorld: A Benchmark for Personally Intelligent Phone AgentsOSGuard: A Benchmark for Safety in Computer-Use AgentsInfantagent-next: A Multimodal Generalist Agent For Automated Computer InteractionProCUA-SFT Technical ReportOSWorld-Human: Benchmarking the Efficiency of Computer-Use AgentsInfiniteweb: Scalable Web Environment Synthesis For GUI Agent TrainingIntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsAgent Alpha: Tree Search Unifying Generation, Exploration and Evaluation for Computer-Use AgentsAgent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path ForwardCaMeLs Can Use Computers Too: System-level Security for Computer Use AgentsBEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI AgentsGrounding Computer Use Agents on Human DemonstrationsAgentic Lybic: Multi-Agent Execution System with Tiered Reasoning and OrchestrationCoAct-1: Computer-using Multi-Agent System with Coding ActionsSurfer 2: The Next Generation Of Cross-platform Computer Use AgentsMano Technical ReportScaling Agents For Computer UseInstruction Agent: Enhancing Agent With Expert DemonstrationUI-TARS-2 Technical Report: Advancing GUI Agent With Multi-turn Reinforcement LearningUi-evol: Automatic Knowledge Evolving For Computer Use AgentsOsworld: Benchmarking Multimodal Agents For Open-ended Tasks In Real Computer EnvironmentsAgent S: An Open Agentic Framework that Uses Computers Like a Human