dataset June 04, 2026

UltraData‑SFT‑2605: 15M+ High‑Quality SFT Samples for MiniCPM5‑1B and Beyond

UltraData‑SFT‑2605, released by OpenBMB, is the full core‑domain supervised‑fine‑tuning (SFT) dataset that powered the post‑training of MiniCPM5‑1B‑SFT, the first 1‑billion‑parameter model in the MiniCPM5 series. The dataset follows the UltraData L0‑L4 tiered data management framework and contains over 15 million carefully curated samples across seven capability domains: Math, Code, Knowledge, Chinese‑general, Instruction‑following (IF), and multilingual Math and Knowledge. Each domain provides two splits – **Deep Thinking** (think) and **Non‑thinking** (no_think) – to teach models both rapid response behavior and multi‑step reasoning or planning.

The data pipeline includes six rigorous stages: open‑source query filtering, internal query construction, high‑quality L3 pre‑training format filtering, answer quality checks, single‑data validation with token‑budgeted SFT runs, and benchmark decontamination. This ensures that every sample is clean, correct, and free from test‑set leakage. The final release offers config‑specific splits that can be loaded directly via the Hugging Face `datasets` library, e.g., `load_dataset("openbmb/UltraData-SFT-2605", "Math", split="think")`.

With 2.5 M deep‑thinking and 3 M non‑thinking math samples, 2.8 M deep‑thinking and 3 M non‑thinking code samples, and comparable volumes for knowledge and instruction‑following, the dataset is a valuable resource for training compact LLMs, conducting domain‑specific fine‑tuning, or researching the impact of reasoning‑rich versus fast‑answer data. Its recent release, high download count, and direct link to a state‑of‑the‑art 1B model make it a trending asset for the LLM community.

Project Ideas

  1. Fine‑tune a 500M‑parameter Chinese LLM on the `Chinese-general` config to improve conversational fluency and reasoning in Mandarin.
  2. Create a comparative study of deep‑thinking vs. non‑thinking data ratios by training two MiniCPM5‑style models on different mixes of the `Math` think/no_think splits.
  3. Build a code‑assistant prototype by fine‑tuning on the `Code` think split, then evaluate on real‑world debugging tasks.
  4. Develop a multilingual factual‑QA system using the `Multi-lang-Knowledge` non‑thinking split to cover English and Chinese knowledge queries.
  5. Design an instruction‑following benchmark by extracting samples from the `IF` config and testing how well small LLMs adhere to multi‑constraint prompts.
← Back to all reports