DiffusionGemma 26B: Fast Multimodal Text Generation with Vision and Reasoning
google/diffusiongemma-26B-A4B-it ↗
DiffusionGemma-26B-A4B-it is an open‑weights multimodal model from Google DeepMind, built on the 26‑billion‑parameter A4B Mixture‑of‑Experts (MoE) Gemma 4 architecture. It follows a novel discrete diffusion approach that denoises blocks of tokens ("canvas") in parallel, delivering generation speeds of over 1100 tokens per second on a single accelerator. The model supports the **image‑text‑to‑text** pipeline, accepting interleaved text, image, and video inputs and producing textual outputs.
The architecture combines an encoder that caches the prompt context with a bidirectional decoder that processes each canvas. Only 8 of 128 experts are active at inference time, keeping the active parameter count to 3.8 B and allowing deployment on a range of hardware. Features include a configurable "thinking mode" for step‑by‑step reasoning, native system‑prompt handling, function‑calling support, and a context window up to 256 K tokens, making it suitable for long‑form tasks.
Vision capabilities cover object detection, OCR (including multilingual), document parsing, UI understanding, chart comprehension, and video analysis (up to 60 s at 1 fps). The model is multilingual out‑of‑the‑box, supporting 35+ languages and pre‑trained on data from over 140 languages. Benchmarks show strong performance on reasoning, coding, and vision‑language tasks, while safety evaluations report improvements over previous Gemma releases.
Getting started is straightforward with the latest Transformers library: `DiffusionGemmaForBlockDiffusion` and `AutoProcessor` can be loaded with a single `from_pretrained` call, and generation follows standard chat‑template usage. The model is released under the Apache 2.0 license and is hosted on Hugging Face under the tags `transformers`, `safetensors`, and `image-text-to-text`.
Project Ideas
- Create a multimodal chatbot that answers user queries by processing both text and uploaded images, leveraging the model's thinking mode for detailed explanations.
- Build an OCR‑driven document summarizer that extracts text from scanned PDFs or photos and generates concise summaries in multiple languages.
- Develop a video description tool that takes short video clips (≤60 seconds) and produces narrated captions or scene‑by‑scene explanations.
- Implement a code‑assistant that accepts screenshots of code, performs visual code understanding, and returns debugging suggestions or completions.
- Design a low‑latency knowledge‑base query system that indexes large text corpora (up to 256 K tokens) and retrieves answers with fast diffusion‑based generation.