One Million Reasoning Traces: KIMI‑K2.5‑1000000x Dataset
The **KIMI‑K2.5‑1000000x** dataset, authored by *ianncity* and released in March 2026, contains one million distilled reasoning traces generated from the KIMI‑K2.5 model on high‑level reasoning tasks. Packaged as a JSON‑line collection of roughly 20 GB (about 5 billion tokens), it falls into the 100 K < n < 1 M size category and is licensed under Apache‑2.0, allowing unrestricted use without attribution.
The dataset is organized into four configurations—General‑Distillation, PHD‑Science, General‑Math, and MultilingualSTEM—each focusing on a different domain. Its distribution spans coding (50 % of entries, covering languages such as Python, C++, Java, and more), science (20 % with a dedicated PHD‑Science subset), mathematics (15 % with a 200 k entry math subset), computer science, logical questions, creative writing, and a multilingual STEM subset. The tags and task categories indicate that the data is suitable for **text generation**, **question answering**, **instruction tuning**, **chain‑of‑thought reasoning**, and **supervised fine‑tuning (SFT)**.
Data were collected using a modified Datagen tool from the TeichAI community over roughly 80 hours. Because the traces are already distilled from a strong reasoning model, they serve as high‑quality examples for training or evaluating large language models on reasoning‑heavy tasks, especially in coding assistance, scientific QA, and multilingual STEM problem solving.
Project Ideas
- Fine‑tune a code‑completion model using the 50 % coding subset to improve multi‑language programming assistance.
- Create a chain‑of‑thought question‑answering bot for scientific topics by leveraging the PHD‑Science configuration.
- Build a multilingual STEM tutor that answers math and science queries using the MultilingualSTEM and General‑Math subsets.
- Develop a logical‑reasoning benchmark suite by extracting logical question traces for evaluation of reasoning capabilities.
- Train an instruction‑tuned conversational agent that can perform creative writing and logical reasoning using the General‑Distillation config.