dataset June 10, 2026

MR‑RATE: Massive 3D MRI‑Report Dataset Fuels Multimodal Medical AI

The MR‑RATE dataset, released by Forithmus in partnership with the University of Zurich and NVIDIA, provides a comprehensive collection of 705,254 non‑contrast and contrast‑enhanced brain and spine MRI volumes from 98,334 imaging studies of 83,425 unique patients. Each study is paired with an anonymized radiology report that has been structured into clinical information, technique, findings, and impression sections using an LLM pipeline. The data are stored in NIfTI format with accompanying CSV metadata, pathology labels (37 categories), brain masks, and defacing masks, all released under a CC‑BY‑NC‑SA license.

Beyond the native‑space volumes, MR‑RATE offers several derivative sets: (1) co‑registered volumes where all modalities within a study are aligned to a T1‑weighted centre image, (2) atlas‑registered volumes aligned to the MNI152 template, (3) voxel‑wise multi‑label segmentations for brain and body generated with NVIDIA’s NV‑Segment‑CTMR model, and (4) automatically derived pathology labels obtained via a large language model. Patient‑level train/validation/test splits are provided to enable reproducible benchmarking across tasks such as image‑to‑text, text‑to‑image, image classification, visual question answering, and zero‑shot classification.

The dataset is distributed across four Hugging Face repositories to accommodate its 8.1 TB native‑space size and additional derivative volumes (up to 17.6 TB for co‑registered data). Access is gated by a terms‑of‑use form that enforces academic‑only, non‑commercial usage and compliance with privacy regulations (GDPR, HIPAA). Although the companion vision‑language foundation model weights are announced as “coming soon,” the dataset itself already serves as a rich resource for training and evaluating multimodal medical AI systems, including synthetic data generation via NVIDIA’s NV‑Generate‑MR‑Brain model.

Project Ideas

  1. Train a multimodal encoder‑decoder model to generate radiology reports directly from 3D MRI volumes using the native‑space images and structured reports.
  2. Develop a visual question answering system that answers clinical queries (e.g., “Is there evidence of hemorrhage?”) by leveraging the MRI volumes and associated report findings.
  3. Fine‑tune a zero‑shot classification model on the 37 pathology labels to automatically tag new MRI studies without additional training data.
  4. Create a pipeline that compares native‑space, co‑registered, and atlas‑registered volumes to evaluate registration quality and improve group‑level analyses.
  5. Use the NV‑Segment‑CTMR segmentations to build an automated ROI extraction tool for quantitative analysis of brain and spine structures.
← Back to all reports