DeepPlanning: Benchmarking Long‑Horizon Agentic Planning with Constraints
DeepPlanning is a newly released dataset from Qwen that serves as a benchmark for evaluating the long‑horizon planning abilities of large language models (LLMs). It focuses on agentic tasks where models must satisfy verifiable constraints across multiple steps, such as budget limits, time windows, and product availability. The benchmark provides realistic multi‑day travel planning and multi‑product shopping scenarios that require proactive information acquisition (e.g., API calls), local constrained reasoning, and global constrained optimization.
The dataset is bilingual (English and Chinese) and is distributed in the WebDataset format, with a size between 1,000 and 10,000 examples. It is tagged for text‑generation, reasoning, planning, and autonomous‑agents, and is linked to the arXiv paper arXiv:2601.18137, which details the benchmark design. Because it targets the emerging need for evaluating LLMs on complex, multi‑step decision making rather than isolated step‑level reasoning, DeepPlanning has quickly become a trending resource on Hugging Face.
Researchers and developers can use DeepPlanning to fine‑tune or probe LLMs, compare planning strategies, or build new agents that must respect global constraints. Its dual domains—travel and shopping—cover both temporal scheduling and combinatorial optimization, making it a versatile testbed for next‑generation AI assistants and autonomous agents.
Project Ideas
- Fine‑tune a multilingual LLM on DeepPlanning to improve its ability to generate budget‑aware travel itineraries.
- Create an evaluation pipeline that measures how well different LLMs satisfy global constraints in the shopping planning tasks.
- Develop a simulated autonomous agent that uses API calls to acquire real‑time information and solves the travel planning scenarios from the dataset.
- Benchmark various prompting strategies (zero‑shot, few‑shot, chain‑of‑thought) on DeepPlanning to identify the most effective approach for long‑horizon reasoning.
- Build a visual dashboard that visualizes constraint violations and optimization metrics for models tested on the DeepPlanning benchmark.