dataset April 29, 2026

GSM8K: 8K Grade‑School Math Problems Powering LLM Reasoning

The GSM8K dataset, released by OpenAI, contains 8,473 English‑language grade‑school math word problems split into a training set of 7,473 examples and a test set of 1,319 examples. Each entry provides a *question* and a natural‑language *answer* that walks through 2–8 arithmetic steps, using calculator annotations (e.g., <<48/2=24>>). Two configurations are offered: the standard **main** config with plain solutions, and the **socratic** config that adds intermediate Socratic sub‑questions to guide reasoning.

Designed as a benchmark for multi‑step logical and arithmetic reasoning, GSM8K targets tasks in the text‑generation category, especially question answering and chain‑of‑thought prompting. The dataset is monolingual English, licensed under MIT, and distributed as Parquet files compatible with the 🤗 datasets library, pandas, polars, and mlcroissant. Its provenance is documented in the paper "Training Verifiers to Solve Math Word Problems" (arXiv:2110.14168) and the OpenAI blog on grade‑school math.

Because the answers are provided in full natural‑language form rather than abstract equations, GSM8K is particularly useful for evaluating how large language models generate step‑by‑step explanations and for fine‑tuning models to produce human‑readable reasoning traces. The Socratic variant further supports research on guided prompting and interactive tutoring scenarios.

Project Ideas

  1. Fine‑tune a small LLM on the `main` config to improve its ability to generate step‑by‑step arithmetic solutions.
  2. Create an evaluation suite that measures a model's chain‑of‑thought accuracy by comparing its outputs on the GSM8K test split to the reference answers.
  3. Build an interactive tutoring chatbot that uses the `socratic` config to ask learners intermediate sub‑questions before revealing the final solution.
  4. Develop a data‑augmentation pipeline that rewrites GSM8K questions in varied phrasing while preserving the original solution steps.
  5. Benchmark different prompting strategies (few‑shot, zero‑shot, chain‑of‑thought) on GSM8K to analyze which yields the highest numeric accuracy.
← Back to all reports