dataset February 27, 2026

OpenResearcher Dataset: Structured LLM Interaction Traces for Tool‑Use Research

The OpenResearcher/OpenResearcher-Dataset is a curated collection of 6,102 multi‑turn conversational examples, each tied to a unique question (qid), a reference answer, and a detailed message log. The logs capture the full interaction flow, including channel metadata, timestamps, knowledge cut‑off dates, model identities, reasoning effort, and tool invocations (e.g., browser actions with parameters such as query, top‑n, and view_source). Each record also stores latency, status, and error information, making it a rich source for both performance analysis and training data.

The dataset is provided in optimized Parquet format and can be loaded via the Hugging Face datasets library, as well as Dask, Polars, and mlcroissant, supporting efficient tabular and text processing at the 1 GB scale. Six configurations (seed_42 through seed_48) expose identical schema with slight variations, all under an MIT license and hosted in the US region. Created by the OpenResearcher team, it has quickly gained traction with over 13 k downloads and a trending score of 37, reflecting community interest in tool‑augmented LLM research.

Because the data includes explicit tool usage structures and latency metrics, it is especially valuable for developers building retrieval‑augmented generation systems, evaluating reasoning speed versus quality, or studying how LLMs interact with external APIs. The mixture of tabular fields (qid, latency, attempts) and nested text fields (messages) enables flexible downstream pipelines for supervised fine‑tuning, prompt engineering, or analytics.

Project Ideas

  1. Fine‑tune a conversational model to predict tool‑use actions by training on the 'messages.tools.browser' structures.
  2. Analyze latency vs. reasoning effort across examples to benchmark the trade‑off between speed and answer quality.
  3. Create a retrieval‑augmented QA system that uses the 'question' and 'answer' pairs as a reference corpus and evaluates against the provided answers.
  4. Build a visualization dashboard (e.g., with Polars) that shows tool invocation frequencies and parameter distributions across the dataset.
  5. Develop a synthetic chat simulator that reproduces the multi‑turn dialogue flow using the 'messages' hierarchy for prompt engineering experiments.
← Back to all reports