dataset February 25, 2026

ToolMind-Web-QA: Synthetic Multi‑Hop Web‑Search QA for Long‑Horizon Agents

ToolMind-Web-QA is a publicly released, synthetic dataset created by Nanbeige for research on search‑augmented and long‑horizon search agents. It contains roughly 6,000 complex question‑answer pairs generated from Wikipedia entity‑relation knowledge graphs, together with full interaction trajectories that average more than 100 turns per conversation. The dataset emphasizes multi‑hop reasoning, evidence‑grounded answers, and the use of external tools such as Serper and Jina for web search and scraping.

The data construction pipeline selects head entities that have been updated within the last six months to ensure temporal relevance, then synthesizes questions by random walks over the entity‑relation graph and converts the paths into natural‑language queries. Trajectories are generated with the Mirothinker framework, and critical turns are judged and retained, resulting in an average of 7.25 critical turns per trajectory despite a much larger number of total turns. Statistics show an average of 45 "Search and Scrape Webpage" actions and 20 "Jina Scrape" actions per trajectory, highlighting the dataset's focus on tool‑driven search.

Benchmark results included in the README compare several large language models on metrics such as GAIA, BrowseComp, and xBench‑Deepsearch. The authors’ own models (ToolMind‑Web‑3B variants and Nanbeige4.1‑3B) achieve competitive scores, particularly on the xBench‑Deepsearch‑05 benchmark where the full ToolMind‑Web‑3B reaches 0.751. An ablation study shows that retaining non‑critical turns in context (while excluding them from loss) improves performance (0.60 vs. 0.33), underscoring the importance of long‑context information for tool‑use scaling.

Overall, ToolMind-Web-QA provides a rich, validated resource for training and evaluating agents that must plan, search, and reason over extended interactions with web‑based tools. The dataset is licensed under Apache‑2.0 and can be cited via the provided arXiv reference (2602.13367).

Project Ideas

  1. Fine‑tune a transformer model on ToolMind-Web-QA to build a search‑augmented QA system that can retrieve and cite Wikipedia evidence.
  2. Develop a benchmark suite for long‑context reasoning by measuring model performance on the critical vs. non‑critical turn splits in the trajectories.
  3. Create a tool‑use curriculum that gradually introduces Serper and Jina actions using the dataset's multi‑hop search sequences.
  4. Train a lightweight 3B‑parameter agent that mimics the Mirothinker framework and evaluate it on the xBench‑Deepsearch‑05 metric.
  5. Design a visualization dashboard that maps the multi‑hop entity‑relation paths underlying each QA pair to help interpret model reasoning.
← Back to all reports