GLM‑5.1‑Reasoning‑1M‑Cleaned: A Curated Corpus for Chain‑of‑Thought Fine‑Tuning
Jackrong/GLM-5.1-Reasoning-1M-Cleaned ↗
The **GLM‑5.1‑Reasoning‑1M‑Cleaned** dataset, released by user **Jackrong**, is a cleaned and reformatted derivative of the original *Kassadin88/GLM-5.1-1000000x* collection. It contains 746,321 high‑quality single‑turn reasoning examples distilled from the GLM‑5.1 teacher model, organized into four exact subsets—*main*, *PHD‑Science*, *Multilingual‑STEM*, and *Math*. Each record follows a unified SFT‑ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields, making it straightforward to plug into Hugging Face training pipelines.
The dataset targets text‑generation and question‑answering tasks, emphasizing chain‑of‑thought reasoning, instruction tuning, and multilingual STEM content in English and Chinese. The cleaning pipeline removed incomplete answers, duplicated paragraphs, refusal‑style replies, and unparseable outputs, resulting in a more consistent format where the model response is wrapped in `<think>...</think>` tags followed by the final answer. Token statistics show substantial reasoning depth, especially in the *Math* subset where median output tokens exceed 24 000.
With a total size under 20 GB and support for the `datasets`, `pandas`, `polars`, and `mlcroissant` libraries, users can load any subset via `load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "subset_name")`. The dataset is licensed under Apache‑2.0, enabling commercial and research use, and it is suitable for fine‑tuning large language models, training chain‑of‑thought prompting strategies, or benchmarking reasoning capabilities across multilingual and scientific domains.
The release also provides clear provenance and citation guidance, encouraging users to acknowledge both the original *Kassadin88* dataset and this cleaned derivative. Its emphasis on well‑structured reasoning traces makes it a valuable resource for anyone looking to improve model interpretability, answer correctness, or to develop specialized assistants for STEM and academic contexts.
Project Ideas
- Fine‑tune a GLM‑5.1 model on the *Math* subset to create a high‑precision mathematical reasoning assistant.
- Build a multilingual STEM Q&A chatbot that leverages the *Multilingual‑STEM* subset for English and Chinese explanations.
- Train a chain‑of‑thought prompting model using the *main* subset to improve answer justification in open‑domain question answering.
- Create a benchmark suite that evaluates reasoning depth by comparing model outputs against the `<think>` tagged traces across all subsets.
- Distill the cleaned dataset into a smaller, instruction‑tuned model for lightweight on‑device reasoning tasks.