dataset June 26, 2026

AgentWorldBench: Benchmarking Language World Models Across Real‑World Environments

AgentWorldBench, released by Qwen, is a curated evaluation benchmark for language world models that simulate environments such as APIs, search engines, terminals, IDEs, Android UIs, web browsers, and desktop OSes. The dataset comprises 2,170 multi‑turn trajectories collected from real executions of frontier models on benchmarks like Tool Decathlon, Terminal‑Bench 1.0/2.0, and OSWorld‑Verified. Each turn includes the original action prompt, the ground‑truth environment observation, and a system prompt that defines the world‑model role.

The benchmark scores model predictions on five dimensions—Format, Factuality, Consistency, Realism, and Quality—providing a reference‑grounded assessment of reasoning, knowledge, and long‑context capabilities required for faithful environment simulation. The repository supplies a standalone evaluation script that runs model inference, uses an LLM judge (e.g., OpenAI's GPT‑5.2) to score the outputs, and aggregates the results. The dataset is stored in per‑domain JSONL files and is licensed under Apache‑2.0.

AgentWorldBench is notable for its breadth of domains (MCP, Search, Terminal, SWE, Android, Web, OS) and its focus on real‑world observation grounding, making it a valuable resource for researchers developing or benchmarking language models that act as world simulators for autonomous agents. The accompanying technical report (arXiv:2606.24597) and blog post provide further context on the evaluation methodology.

Project Ideas

Fine‑tune a language model on the AgentWorldBench prompts and observations to improve its ability to predict realistic environment states.
Build a benchmarking pipeline that runs multiple LLMs on the dataset and visualizes their scores across the five evaluation dimensions.
Create a domain‑specific LLM judge that replaces the OpenAI API, using the provided system prompts to evaluate model outputs locally.
Develop an interactive dashboard that lets users explore individual trajectories, compare model predictions with ground‑truth observations, and filter by domain or turn index.
Train a classifier that predicts the quality dimension (e.g., Factuality) of a generated observation, using the annotated scores from the benchmark as supervision.

← Back to all reports