dataset May 29, 2026

MONET: 105M Enriched Image‑Text Pairs for Text‑to‑Image Pre‑training

The MONET (Massive, Open, Non‑redundant and Enriched Text‑to‑image) dataset provides 104.9 million high‑quality image‑text pairs curated from 2.9 billion raw pairs across nine open sources. It combines six real‑world datasets (LAION‑2B‑en, COYO, Common‑Catalog‑CC‑BY, Megalith‑10M, Conceptual‑12M, Diffusion‑Aesthetic‑4K) with three synthetic generators (Z‑Image, FLUX‑2‑klein‑4B, FLUX‑1‑schnell). A multi‑stage pipeline applies safety filtering, domain and resolution constraints, duplicate removal (URL, perceptual hash, SSCD near‑duplicate detection), and re‑captioning with four vision‑language models (Florence‑2, InternVL‑3‑8B, ShareGPT4V‑7B, Gemini‑2.5‑flash‑lite). The final collection includes original alt‑text captions and up to four synthetic captions per image.

Each entry ships with extensive metadata: aesthetic scores (JasperAI and LAION‑improved), NSFW scores from three classifiers, watermark probability, and various hashes for provenance. Structured annotations include YOLO‑v9e object detections (80 COCO categories), YOLO‑v8x ImageNet‑1k classification scores, and MediaPipe face landmarks. Pre‑computed embeddings are provided for CLIP ViT‑B/32 (512‑d), DINOv2 ViT‑G/14 (1536‑d), and SSCD copy‑detection (512‑d). A pre‑encoded SANA‑VAE latent enables direct latent‑diffusion training without loading full‑resolution images. The dataset is released under Apache‑2.0 and is intended for text‑to‑image pre‑training, image‑text retrieval, and zero‑shot image classification.

The repository offers two configurations: a parquet view with low‑resolution thumbnails and all metadata/embeddings for fast browsing and filtering, and a webdataset view with full‑resolution JPEGs and binary tensors for large‑scale training. Example code shows streaming loading, source‑based filtering, and multi‑node training with WebDataset. The authors note that a 4 B‑parameter latent diffusion model trained solely on MONET achieves competitive GenEval and DPG scores, demonstrating the dataset’s utility for reproducible large‑scale text‑to‑image research.

Project Ideas

Fine‑tune a latent diffusion model directly on the pre‑encoded SANA‑VAE latents to create a new text‑to‑image generator.
Build an image‑text retrieval engine using the CLIP ViT‑B/32 embeddings and evaluate recall on a held‑out subset.
Create a zero‑shot image classification benchmark by applying the CLIP classifier scores to novel categories.
Develop a safety‑filter pipeline that combines the three NSFW scores, watermark probability, and aesthetic scores to flag low‑quality or unsafe content.
Train a multi‑modal object detection model that leverages the YOLO‑v9e detections and DINOv2 embeddings for downstream tasks such as scene understanding.

← Back to all reports