dataset March 28, 2026

Michael Hafftka Catalog Raisonné: 3.8K Paintings with Rich Metadata

The **Michael Hafftka – Catalog Raisonné** dataset is a curated collection of roughly 3,800 digitized paintings by the American expressionist Michael Hafftka, spanning the period from the 1970s through 2025. Each entry pairs a high‑resolution image (stored in optimized Parquet format) with structured metadata fields such as title, year, medium, dimensions, collection, and a unique catalog number. The dataset totals about 53 GB and is distributed under a CC‑BY‑NC‑4.0 license, allowing non‑commercial research and attribution.

What makes this dataset stand out is its **single‑artist consistency** and **longitudinal coverage**. Unlike generic art collections, every work originates from the same hand, providing a rare opportunity to study stylistic evolution, technique shifts, and thematic changes across five decades. The inclusion of museum‑provenance pieces from institutions like the Met, MoMA, and SFMOMA adds scholarly value, while multiple photographic views (front, detail, side panels) enrich visual analysis.

The dataset is tagged for multimodal, computer‑vision, and digital‑humanities tasks, and the README highlights several research avenues: training LoRA or style‑transfer models, building image‑plus‑metadata retrieval (RAG) systems, and exploring generative‑art experiments. Its compatibility with the Hugging Face `datasets` library, as well as Dask, Polars, and mlcroissant, makes it straightforward to load, filter, and process for both exploratory analysis and large‑scale model training.

Because metadata completeness varies—earlier works often lack fields like medium or dimensions—researchers can also investigate methods for handling missing data, archival documentation practices, and the impact of heterogeneous image quality on model performance. Overall, the dataset offers a rich playground for art historians, machine‑learning engineers, and creative technologists alike.

Project Ideas

  1. Fine‑tune a LoRA adapter on the images to create a style‑transfer model that reproduces Hafftka's expressionist brushwork.
  2. Build a multimodal RAG system that retrieves paintings based on textual queries using the title, year, and collection metadata.
  3. Create an interactive timeline visualisation that maps Hafftka's works by decade, showing stylistic trends and medium changes.
  4. Train a decade‑classification model on the images and year field to predict the creation period of unknown artworks.
  5. Generate synthetic Hafftka‑style paintings with a diffusion model conditioned on metadata such as medium and dimensions.
← Back to all reports