model February 15, 2026

GLM-OCR: Multilingual, High‑Performance OCR for Complex Documents

GLM-OCR is a multimodal OCR model built on the GLM‑V encoder‑decoder architecture, integrating the CogViT visual encoder and a lightweight cross‑modal connector with a GLM‑0.5B language decoder. It operates as an image‑to‑text pipeline (pipeline_tag: image-to-text) and supports eight languages (zh, en, fr, es, ru, de, ja, ko). The model is distributed via the Transformers library and released under the MIT license.

The model achieves state‑of‑the‑art results on the OmniDocBench V1.5 benchmark (94.62 score, #1 overall) and performs strongly across tasks such as formula recognition, table recognition, and information extraction. Its two‑stage pipeline combines PP‑DocLayout‑V3 for layout analysis with parallel recognition, delivering robust handling of complex tables, code‑heavy documents, seals, and other challenging real‑world layouts. With only 0.9 B parameters, GLM‑OCR can be served efficiently through vLLM, SGLang, Ollama, or the official SDK, reaching up to 1.86 pages/second for PDFs.

GLM‑OCR provides a full SDK that abstracts the layout analysis and structured output generation, making it easy to integrate into production pipelines. The README supplies detailed installation steps for various serving back‑ends and example code for using the model via the Transformers API. The project acknowledges contributions from PP‑DocLayout‑V3, PaddleOCR, and MinerU, and it is designed for both document parsing (raw content extraction) and information extraction (JSON‑schema‑based structured output).

Project Ideas

  1. Create a multilingual document digitization service that extracts raw text, tables, and formulas from scanned PDFs using the GLM-OCR SDK.
  2. Build an automated invoice processing pipeline that parses line items, totals, and vendor information and outputs structured JSON for accounting systems.
  3. Develop a scientific paper assistant that recognizes LaTeX formulas and converts them into editable text for researchers.
  4. Implement a real‑time OCR chatbot that accepts image uploads and returns extracted content or answers questions about the document.
  5. Design a table extraction tool for financial reports that outputs tables as CSV files, leveraging GLM-OCR's high accuracy on complex layouts.
← Back to all reports