dataset April 14, 2026

Hermes Agent Reasoning Traces: Real Tool-Calling Trajectories for AI Agent Training

The **Hermes Agent Reasoning Traces** dataset, released by lambda, provides multi‑turn tool‑calling trajectories captured from two powerful LLMs: Moonshot AI's Kimi‑K2.5 and ZhipuAI's GLM‑5.1-FP8. Each sample contains a full conversation in ShareGPT format, enriched with step‑by‑step chain‑of‑thought reasoning (`<think>` blocks), explicit function invocations (`<tool_call>`), and the actual results of those calls (`<tool_response>`). The dataset is split into two configs—**kimi** (7,646 samples) and **glm-5.1** (7,055 samples)—and is stored in optimized Parquet files, making it easy to load with the `datasets` library.

The traces cover a diverse nine‑category taxonomy, ranging from terminal and coding tasks, browser automation, and repository work to specialized agent tool usage, file operations, and planning. Statistics show an average of 24.3 turns and 13.9 tool calls per sample for the Kimi config, reflecting rich interaction depth. All tool executions are real (e.g., Playwright‑driven browsing, file manipulation, code compilation), not synthetic, which makes the data valuable for training agents that must both reason and act reliably.

Generated with the Hermes‑agent harness and the `hermes-agent-generator` pipeline, the dataset includes the exact tool definitions available to the agents, enabling reproducible fine‑tuning or evaluation of new models on realistic tool‑use scenarios. Licensed under Apache‑2.0, it is ready for research and commercial use.

Because the dataset captures both reasoning and execution, it can serve as a benchmark for assessing an agent's ability to plan, invoke tools, and incorporate tool feedback—a critical capability for emerging autonomous AI assistants.

Project Ideas

Fine‑tune a small LLM to become a tool‑calling agent using the Kimi or GLM‑5.1 traces as supervised data.
Create an evaluation suite that measures how well a model reproduces the `<think>` reasoning depth and tool‑call accuracy on the dataset.
Build a retrieval‑augmented system that selects relevant past traces from the dataset to guide new agent conversations.
Develop a synthetic data generator that expands the existing categories by recombining conversation segments while preserving tool execution consistency.
Train a classifier to predict the high‑level task category (e.g., Terminal & Coding, Browser Automation) from the initial user prompt using the provided `category` field.

← Back to all reports