Wikimedia Structured Wikipedia: Ready‑to‑Query Multilingual Knowledge in Parquet
wikimedia/structured-wikipedia ↗
The Wikimedia Structured Wikipedia dataset provides pre‑parsed English and French Wikipedia articles in a unified Parquet format, totalling 44.42 GiB and over 10 million rows. Built by the Wikimedia Enterprise team, the dataset captures article titles, identifiers, URLs, abstracts, descriptions, main images, infoboxes, sections, tables, references, citations, and credibility signals such as `referenceneed` and `referencerisk`. Each article also links to its Wikidata QID, enabling seamless integration with external knowledge graphs.
Designed for high‑performance analytical workloads, the dataset can be queried directly with DuckDB, pandas, Polars, Spark, or Dask without additional preprocessing. The pinned schema ensures compatibility across tools, and JSON‑encoded nested fields (e.g., `sections`, `infoboxes`, `tables`) can be decoded on the fly. This makes the data ideal for tasks like retrieval‑augmented generation, fine‑tuning language models, building knowledge bases, and benchmarking information extraction pipelines.
The release is notable because it moves beyond raw wiki markup to a structured, machine‑readable format that includes parsed tables and references—features that were previously difficult to extract at scale. Its multilingual support (English and French) and open CC‑BY‑SA‑4.0 license have attracted attention from the AI community seeking high‑quality, openly licensed knowledge sources. The dataset’s beta status invites feedback, positioning it as a living resource that will expand to more languages and richer annotations over time.
Project Ideas
- Create a searchable knowledge‑graph service that links Wikipedia articles via their Wikidata QIDs and infobox fields for rapid entity lookup.
- Develop a credibility‑analysis dashboard that visualises `referenceneed` and `referencerisk` scores across topics to identify under‑sourced content.
- Build a table‑query engine that extracts structured data from Wikipedia tables and allows SQL‑style queries using DuckDB or Polars.
- Fine‑tune a retrieval‑augmented generation model that answers user questions by pulling relevant article sections, tables, and references from the dataset.
- Assemble a multilingual summarisation benchmark by pairing article abstracts with their one‑sentence descriptions for English and French.