dataset June 09, 2026

Ultra-FineWeb-L3: Massive Bilingual Synthetic QA & Style‑Rich Corpus for LLM Pre‑training

Ultra-FineWeb-L3, released by the openbmb team, is the L3 tier of the UltraData tiered data management framework. It refines the trillion‑token Ultra-FineWeb web corpus through two synthesis steps—Q&A pair generation and multi‑style rewriting—using MiniCPM4 and Qwen‑3 models. The result is a synthetic dataset of over 400 B English tokens and 200 B Chinese tokens, making it the largest publicly available Chinese pre‑training synthetic corpus to date. Four configurations are provided (English QA, English multi‑style, Chinese QA, Chinese multi‑style), each stored as Parquet files and accessible via the 🤗 datasets library (with optional Dask or Polars back‑ends).

The dataset targets text‑generation pre‑training for large language models, especially the decay phase of MiniCPM5‑1B. By converting declarative web text into structured "original text + multiple Q&A pairs" and rewriting the same content in diverse styles (encyclopedia, textbook, blog, abstract), Ultra-FineWeb-L3 supplies richer reasoning signals and stylistic diversity than raw web data. Tags such as "llm", "pretraining", "data‑synthesis", and "high-quality" reflect its intended use for improving general knowledge, reasoning, and style robustness in LLMs.

Ultra-FineWeb-L3 is licensed under Apache‑2.0 and is linked to several related resources, including the Ultra‑FineWeb classifier for data filtering and the UltraData platform that documents the L0‑L4 tiered framework. Researchers can cite the associated arXiv reports (2505.05427 and 2602.09003) and the dataset’s own citation entry. The dataset’s popularity is evident from its 58 k downloads, 277 likes, and a trending score of 34 shortly after release.

Project Ideas

Fine‑tune a bilingual LLM on the English and Chinese QA configurations to improve zero‑shot question answering across both languages.
Train a style‑transfer model using the multi‑style configurations, enabling conversion of arbitrary text into encyclopedia, textbook, blog, or abstract styles.
Benchmark the impact of L3‑refined data by pre‑training two identical LLMs, one on Ultra‑FineWeb-L3 and one on the original Ultra‑FineWeb, then compare reasoning and style diversity on downstream tasks.
Build a streaming data pipeline with Dask or Polars to efficiently feed the massive Parquet files into large‑scale LLM training clusters.
Create a multilingual knowledge‑graph extraction system that leverages the structured Q&A pairs to populate entity‑relation triples for downstream retrieval‑augmented generation.

← Back to all reports