dataset June 27, 2026

Unlocking arXiv’s LaTeX Treasure: A Monthly Parquet Dataset for Researchers

The **arXiv LaTeX Source Dataset** (ID: `scholarweave/arxiv-latex`) offers the complete, pre‑parsed LaTeX source files of every arXiv paper, aligned with the official metadata and stored in ready‑to‑query Parquet partitions. Each row contains identifiers, authors, titles, abstracts, version histories, and a `latex` field that aggregates all source files (`.tex`, `.bib`, `.sty`, etc.) into a single, readable string. The dataset is sized between 1 M and 10 M records and is updated monthly, with a manifest that tracks file hashes, timestamps, and the original S3 tar archives used for generation.

Created by the community contributor **scholarweave**, this collection addresses two major pain points for large‑scale scholarly research: (1) the high egress fees associated with directly downloading arXiv’s bulk S3 bucket (>$450 for the full 5 TB), and (2) the heavy CPU load required to unpack, parse, and align nested tar‑gz archives with their JSON metadata. By ingesting the raw data inside the free‑transfer AWS region (`us-east-1`) and publishing the cleaned Parquet files on Hugging Face, the dataset enables researchers to bypass both network costs and complex preprocessing pipelines.

The dataset is tagged for **text‑generation** and **feature‑extraction** tasks, making it suitable for training or fine‑tuning language models that understand scientific LaTeX, building retrieval or recommendation systems, and extracting structured information such as equations, citations, or author networks. It is compatible with the `datasets` library as well as Dask, Polars, and mlcroissant, offering flexible data‑processing options for Python‑centric workflows.

Because the original arXiv licenses are preserved in the `license` column, users must respect the individual paper licenses when redistributing or publishing derived works. Overall, the dataset provides a cost‑effective, scalable foundation for any project that needs large‑scale access to scholarly LaTeX content.

Project Ideas

  1. Fine‑tune a generative language model on the LaTeX source to produce draft scientific papers or sections in authentic arXiv style.
  2. Build a searchable index over the `latex` field to enable fast full‑text retrieval of equations and code snippets across millions of papers.
  3. Create a citation‑recommendation system that leverages abstracts, categories, and version histories to suggest relevant prior work.
  4. Train a model to extract and normalize mathematical expressions from the LaTeX source for downstream symbolic reasoning tasks.
  5. Analyze temporal trends in research topics by aggregating the `categories` and `update_date` columns to visualize emerging fields.
← Back to all reports