olmOCR-bench: The New Standard for PDF‑to‑Markdown OCR Evaluation
olmOCR-bench, released by AllenAI, is a benchmark dataset comprising 1,403 PDF files and 7,010 unit test cases that capture the properties a high‑quality OCR system should preserve when converting PDFs to markdown. The dataset is split into seven categories—arXiv Math, headers/footers, long tiny text, multi‑column layouts, old scanned documents (with and without math), and table tests—each reflecting distinct challenges such as reading order, table structure, and mathematical formula rendering. The README outlines detailed evaluation criteria, including text presence/absence checks, natural reading order verification, table cell accuracy, and math formula layout matching, all designed to assess both textual fidelity and structural integrity.
The benchmark provides a comprehensive results table comparing a range of OCR models, from open‑source systems like GOT OCR and MinerU to commercial APIs such as GPT‑4o and Gemini Flash 2. Scores are reported per document source, highlighting strengths and weaknesses across different content types. Notably, the "Ours (Anchored)" configuration achieves the highest overall score (77.4 ± 1.0), demonstrating the dataset's utility for tracking state‑of‑the‑art performance. Licensed under ODC‑BY‑1.0, olmOCR-bench is intended for research and educational use, encouraging the community to develop and evaluate OCR pipelines that can reliably produce markdown outputs from diverse PDF sources.
Project Ideas
- Build an end‑to‑end OCR pipeline that converts PDFs to markdown and benchmark it on each olmOCR‑bench split.
- Create a post‑processing model that corrects table and math formatting errors identified by the dataset's evaluation criteria.
- Analyze performance gaps across document categories to guide targeted improvements for specific OCR challenges like multi‑column layouts.
- Develop a visualization dashboard that plots OCR accuracy metrics (text presence, reading order, table accuracy) for different models using the benchmark results.
- Fine‑tune an existing OCR model on the 1,403 PDFs and evaluate the impact using the 7,010 unit tests to measure gains in markdown fidelity.