model March 09, 2026

Phi-4 Reasoning Vision 15B: Multimodal AI with Chain‑of‑Thought Power

Phi-4-Reasoning-Vision-15B is an open‑weight multimodal model released by Microsoft on March 4, 2026. It combines the Phi‑4‑Reasoning language backbone (5 B–15 B parameters) with a SigLIP‑2 vision encoder in a mid‑fusion architecture, allowing it to ingest both text and images and generate text responses. The model supports the `image-text-to-text` pipeline tag and is built for English‑only vision‑language tasks. Training used 240 NVIDIA B200 GPUs for four days, integrating supervised fine‑tuning on a curated mix of reasoning and perception data, and includes a safety post‑training step that teaches refusal behavior for harmful content.

The model excels at visual question answering and reasoning over complex visual inputs. Benchmark results show accuracy of 84.8 % on AI2D (diagram understanding), 83.3 % on ChartQA, 75.2 % on MathVista‑MINI, 88.2 % on ScreenSpot‑V2 (GUI grounding), and 76 % on OCRBench. It can switch between a fast "NOTHINK" mode for straightforward perception tasks and a structured "THINK" mode that produces chain‑of‑thought reasoning using `<think>...</think>` blocks, making it suitable for both captioning and multi‑step problem solving.

Phi-4‑Reasoning‑Vision‑15B is positioned for resource‑constrained environments while still delivering high‑quality multimodal reasoning. Its primary use cases include scientific and mathematical problem solving from images (handwritten equations, charts, diagrams), computer‑use agents that interpret screen content and locate GUI elements, as well as general tasks such as image captioning, OCR, and object grounding. The model is released under an MIT license, distributed via Hugging Face, GitHub, and Azure AI Foundry, and requires PyTorch ≥ 2.7.1, Transformers ≥ 4.57.1, and optionally vLLM for inference.

Project Ideas

  1. Build an educational app that solves math problems from photographed worksheets using the model's THINK mode for step‑by‑step explanations.
  2. Create a desktop assistant that reads screen captures, identifies buttons or menus with GUI grounding, and suggests next actions for users with accessibility needs.
  3. Develop a data‑analysis tool that extracts quantitative information from charts and graphs in PDFs, returning structured tables via the model's visual question answering capability.
  4. Implement an OCR‑enhanced document search engine that indexes scanned documents and answers natural‑language queries about their content using the model's image‑text‑to‑text pipeline.
  5. Design a research assistant that generates detailed captions and reasoning summaries for scientific figures, supporting both quick NOTHINK descriptions and deeper THINK analyses.
← Back to all reports