KIMI-K2.5-450000x: 450K High‑Quality Reasoning Traces for LLM Tuning
The **KIMI-K2.5-450000x** dataset, authored by *ianncity*, contains 450,000 distilled reasoning traces generated from the KIMI‑K2.5 model under a "high" reasoning setting. With a total token count of 1.8 B, the data spans six domains: coding (60%), science (15%), math (10%), computer science (5%), logical questions (5%) and creative writing (5%). Each entry is provided in JSON format and is annotated for chain‑of‑thought and instruction‑tuning, making it suitable for both text‑generation and question‑answering tasks.
The dataset was collected in roughly 20 hours using a modified Datagen pipeline from **TeichAI**, and is released under the Apache‑2.0 license. Tags such as *reasoning*, *chain-of-thought*, *instruction-tuning*, and *sft* indicate that the traces are intended for supervised fine‑tuning of large language models (LLMs) to improve their logical and step‑by‑step problem‑solving abilities. The presence of libraries like **datasets**, **pandas**, **polars**, and **mlcroissant** suggests easy integration into common data‑processing pipelines.
Because the collection emphasizes high‑quality reasoning across programming, scientific, and mathematical topics, it has quickly become a trending resource for researchers and developers looking to enhance LLM performance in complex, multi‑step tasks. The dataset’s balanced domain distribution also supports building specialized QA systems or creative writing assistants that can reason transparently.
Project Ideas
- Fine‑tune a compact LLM to generate chain‑of‑thought solutions for coding interview questions using the 60% coding subset.
- Create a domain‑specific QA bot for science and math by training on the science (15%) and math (10%) traces.
- Build a benchmark suite that measures an LLM's logical reasoning accuracy across the 5% logical‑question segment.
- Develop a prompt‑engineering tool that extracts step‑by‑step instruction patterns from the dataset for better instruction tuning.
- Generate synthetic instruction‑tuning data for creative writing assistants by adapting the 5% creative‑writing traces.