dataset March 17, 2026

FinePhrase: 1.35B Synthetic Samples for FAQ, Math, Tables & Tutorials

FinePhrase is a massive synthetic dataset created by DataTrove using the SmolLM2-1.7B-Instruct model. It re‑writes source documents from the FineWeb‑Edu corpus into four distinct instructional formats—FAQ, mathematical word problems, structured tables with QA, and step‑by‑step tutorials—through carefully crafted prompts. The dataset contains over 1.35 billion generated samples (≈486 billion completion tokens) across its four configurations, making it one of the largest publicly available instruction‑style corpora.

The dataset is organized into separate configs (faq, math, table, tutorial) each providing a `text` field with the original document and a `rollout_results` list containing the model‑generated transformation. It is intended for tasks such as text‑generation, language‑modeling, and instruction fine‑tuning, and can be loaded directly via the `datasets` library. The synthetic nature (machine‑generated) and the open ODC‑BY license facilitate unrestricted research and development.

FinePhrase is trending due to its scale, the diversity of transformation tasks it supports, and its utility for training or evaluating models that need to perform complex text re‑formatting or instructional generation. Researchers can leverage the dataset to improve model abilities in generating FAQs, math problems, tabular data, or tutorials, while also studying the impact of large‑scale synthetic data on downstream performance.

Project Ideas

  1. Fine‑tune a language model on the `faq` config to create a specialized FAQ‑generation assistant.
  2. Train a model using the `math` config to automatically convert explanatory text into multi‑step math word problems with solutions.
  3. Develop a table‑extraction pipeline that learns from the `table` config to generate markdown tables and corresponding QA pairs from raw documents.
  4. Build a tutorial‑authoring system by fine‑tuning on the `tutorial` config, enabling step‑by‑step instructional content creation.
  5. Benchmark instruction‑following performance across the four prompt families to analyze how models handle different transformation styles.
← Back to all reports