Complete FABLE.5 Traces 2M: A Massive Deduped Agentic Trace Corpus
Glint-Research/Complete-FABLE.5-traces-2M ↗
The Complete FABLE.5 Traces 2M dataset, curated by Glint-Research, aggregates over 2 million cleaned rows of agentic interaction traces from the FABLE.5 / Mythos family of corpora. After a post‑closure curation pass, the dataset removes rows containing the session‑limit assistant‑answer pattern, yielding 2,006,487 clean rows stored in both a columnar Parquet file (1.94 GiB) and a gzip‑compressed JSONL mirror. Each row retains provenance metadata—including the original source dataset, config, split, and row index—allowing researchers to trace duplicate entries back to their origins.
The dataset is fully synthetic, generated by language models, and is annotated with tags such as "chain‑of‑thought", "tool‑use", "coding‑agents", and "llm‑traces", indicating that the rows capture complex reasoning, tool‑calling behavior, and code‑generation scenarios. Its primary task category is text generation for language modeling, making it suitable for training or fine‑tuning large language models on agentic workflows, reasoning chains, and programming tasks. The collection draws from 17 upstream Hugging Face datasets, providing a rich, deduplicated snapshot of the broader FABLE.5 / Mythos ecosystem.
Because the data is provided in a clean, canonical schema—with fields like `row_hash`, `seen_count`, and the original `row_json`—it can be directly loaded via the Hugging Face `datasets` library (`load_dataset("Glint-Research/Complete-FABLE.5-traces-2M", split="train")`). Researchers can leverage the provenance fields for provenance‑aware filtering, deduplication analysis, or cross‑dataset comparison. The MIT license permits unrestricted use, encouraging community‑driven exploration of agentic trace modeling and tool‑use behavior in LLMs.
Project Ideas
- Fine‑tune a language model on the dataset to improve chain‑of‑thought reasoning and tool‑use capabilities.
- Create an interactive explorer that visualizes provenance metadata, showing how rows originated from the 17 source datasets.
- Benchmark different code‑generation models on the coding‑agent traces to assess their ability to follow tool‑use instructions.
- Develop a deduplication analysis pipeline that uses `row_hash` and `seen_count` to study redundancy patterns across source corpora.
- Train a specialized agentic LLM that can generate multi‑step problem‑solving traces, using the dataset as supervised training data.