dataset March 27, 2026

AutoMathText-V2 Dataset Overview Report

AutoMathText-V2 is a curated collection of 52 premium data sources spanning web content, mathematics, code, reasoning, formal proofs, and bilingual translation. It aggregates over 1.5 trillion tokens, organized by domain and quality tiers (0-100% token ranges). The dataset undergoes a three‑tier deduplication process (exact, fuzzy via MinHash LSH, and semantic using GTE embeddings) that removes roughly 40% redundant material. Quality is assessed with a Qwen2‑based regression model, and all text is cleaned using Ultimate Data Cleaner v7.5, which provides robust LaTeX and code protection. Contamination detection safeguards evaluation test sets (e.g., GSM8K). While most domains include full text, the RefineCode domain supplies only metadata; actual code can be retrieved from AWS S3 using the provided blob IDs. The dataset is accessible via Hugging Face's `load_dataset` with configurations for high‑quality subsets (medium‑high, high, ultra) and individual domains.

Project Ideas

  1. Use the high‑quality token subsets (90‑100%) for fine‑tuning large language models targeting math and code reasoning tasks.
  2. Combine the domain‑specific subsets (e.g., math_web, megamath) to create a multi‑modal training corpus that interleaves natural language, LaTeX, and code snippets.
  3. Leverage the RefineCode metadata to retrieve full source code from S3, enabling code‑centric pre‑training or retrieval‑augmented generation experiments.
  4. Apply the three‑tier deduplication pipeline as a reusable framework for cleaning and de‑duplicating other large web‑scale corpora.
  5. Develop benchmark suites using the percentile‑based quality splits to evaluate model performance across varying data quality levels.
← Back to all reports