dataset April 06, 2026

FineWeb: 18.5 T Tokens of High‑Quality English Web Text Now Open

The FineWeb dataset, released by HuggingFaceFW, provides over 18.5 trillion tokens of cleaned and deduplicated English web data sourced from CommonCrawl. Licensed under ODC‑By 1.0, it targets the text‑generation task and is positioned as an open‑source improvement over RefinedWeb, offering superior downstream performance on a suite of benchmark tasks. The dataset spans crawls from 2013 up to the latest 2025 dumps, organized into many configurations—full‑dataset, per‑dump snapshots (e.g., CC‑MAIN‑2025‑05), and sampled subsets of roughly 10 B, 100 B, and 350 B GPT‑2 tokens for more manageable experimentation.

FineWeb is distributed as a collection of Parquet files and can be accessed via the `datasets` library, `huggingface_hub`, or directly through the large‑scale data‑processing library Datatrove. The README includes ready‑to‑run Python snippets showing how to stream documents, filter them with custom lambda functions, and write results to JSONL. In addition to the raw data, the release bundles the full processing code (a Datatrove script) and a set of small ablation models trained with Nanotron, together with evaluation results and LightEval task definitions to enable reproducible research.

Since its initial v1.0.0 launch in April 2024, FineWeb has seen regular updates, most recently v1.4.0 (July 2025) which added six new 2025 snapshots. Smaller sample configs—`sample-10BT`, `sample-100BT`, and `sample-350BT`—make it feasible to experiment on modest hardware while still benefiting from the dataset's high‑quality filtering pipeline. The extensive documentation, changelog, and open codebase make FineWeb a valuable resource for anyone building, fine‑tuning, or evaluating large language models.

Project Ideas

  1. Fine‑tune a small language model on the `sample-10BT` subset to create a domain‑specific text generator for technical documentation.
  2. Build a custom data‑cleaning pipeline with Datatrove that re‑filters FineWeb for a specialized corpus (e.g., medical articles) and streams the results to JSONL.
  3. Use the provided LightEval task file to benchmark a newly trained model against FineWeb’s ablation models and other web datasets.
  4. Create a summarization benchmark by extracting the first paragraph of each FineWeb document and pairing it with a model‑generated summary.
  5. Develop a content‑moderation classifier that leverages FineWeb as background data to train a detector for personal or sensitive information.
← Back to all reports