dataset February 26, 2026

Common Corpus: 2.3 T Token Open Multilingual Text Dataset

Common Corpus, released by PleIAs and a network of partners, is currently the largest openly licensed text collection, containing 2.27 trillion tokens across more than a dozen languages. The dataset aggregates books, newspapers, scientific articles, government and legal documents, source code, and semantic statements, all stored in Parquet files and accessible via the HuggingFace `datasets` library (with pandas, polars, and mlcroissant support). Its six curated sub‑collections—OpenCulture, OpenGovernment, OpenSource, OpenScience, OpenWeb, and Open Semantic—provide domain‑specific slices ranging from public‑domain literature to high‑quality open‑source code and structured knowledge from Wikidata.

The corpus is distinguished by three core principles: (1) **True openness** – every document is either uncopyrighted or released under a permissive license, allowing unrestricted commercial and non‑commercial use; (2) **Traceability** – each record includes provenance metadata such as original URL, licensing, creation date, and creator, enabling fine‑grained filtering; (3) **Multilingual diversity** – while English and French dominate, the dataset includes over 10 billion tokens for German, Spanish, Italian, Polish, Greek, and Latin, and more than 1 billion tokens for 33 additional languages, including Chinese, Japanese, Arabic, and Korean. The accompanying ICLR 2026 oral paper (arXiv:2410.22587) details the curation pipeline, OCR correction, toxicity filtering (using PleIAs' Celadon classifier), and PII removal (via Microsoft Presidio).

Common Corpus is positioned as a reference resource for training open‑source large language models that satisfy the Open Source Initiative’s definition of open AI. Its rich metadata—identifier, collection, license, date, title, creator, language, word and token counts, and raw text—makes it suitable for a wide range of research and product development tasks, from multilingual language modeling and domain‑specific fine‑tuning to provenance‑aware data governance studies.

Project Ideas

  1. Fine‑tune a multilingual LLM on the OpenGovernment slice to build a legally‑aware question‑answering assistant.
  2. Create a code generation model by training on the OpenSource collection, leveraging the high‑quality filtered GitHub code.
  3. Develop a historical OCR correction benchmark using the OpenCulture documents, comparing PleIAs OCR tools against baseline models.
  4. Build a cross‑lingual semantic search engine that indexes the OpenSemantic statements and retrieves relevant passages in any of the 33 supported languages.
  5. Design a data‑governance dashboard that visualizes licensing and provenance metadata across the corpus to help organizations audit their training data pipelines.
← Back to all reports