dataset May 13, 2026

FineWeb‑Edu: 1.3 T Tokens of Curated Educational Web Data

FineWeb‑Edu is a massive English‑language dataset of 1.3 trillion tokens drawn from the public CommonCrawl web, filtered to retain only high‑quality educational content. The dataset was created by HuggingFaceFW and released alongside an educational quality classifier that was trained on synthetic annotations generated by Llama 3‑70B‑Instruct. By applying this classifier with a threshold of 3 (on a 0‑5 scale), the authors selected pages aimed at grade‑school and middle‑school knowledge levels, producing a collection that outperforms the original FineWeb on standard benchmarks.

The dataset is provided in Parquet format and can be accessed via the `datasets` library, `datatrove`, or directly with a Parquet reader. It includes a default configuration containing all filtered dumps from 2013 onward, as well as snapshot‑specific configs (e.g., `CC-MAIN-2025-05`) and smaller sampled versions (`sample-10BT`, `sample-100BT`, `sample-350BT`) that contain roughly 10 B, 100 B, and 350 B GPT‑2‑equivalent tokens respectively. The README details loading instructions, sample pipelines using `datatrove`, and links to the associated paper (arXiv:2406.17557) and classifier repository.

Since its initial v1.0.0 release in June 2024, FineWeb‑Edu has seen several updates, most recently v1.4.0 (July 2025) which added six new 2025 snapshots covering January to June 2025. A deduplicated version is also available in the SmolLM‑Corpus, confirming that deduplication does not noticeably affect downstream model performance. The dataset, classifier code, and inference scripts are all open‑source, enabling researchers to reproduce the curation pipeline or adapt the classifier for other domain‑specific filters.

Project Ideas

  1. Fine‑tune a small LLM on FineWeb‑Edu to build a grade‑school level question‑answer assistant.
  2. Create an educational content recommender that matches curriculum topics with relevant web passages from the dataset.
  3. Benchmark existing LLMs on educational tasks by evaluating them against the FineWeb‑Edu filtered corpus.
  4. Develop a searchable educational knowledge base using the Parquet files and Polars for fast retrieval.
  5. Generate curriculum‑aligned reading passages or practice exercises by sampling and prompting the dataset.
← Back to all reports