Zero-to-CAD 1M: A Million Synthetic, Executable CAD Programs for AI Generation
Zero-to-CAD 1M, released by Autodesk Research (ADSKAILab), is a synthetic dataset of 999,633 executable CadQuery Python scripts that construct parametric 3‑D models. Each entry includes the source code, rendered multi‑view images, STL and STEP files, and detailed metadata such as operation lists, face counts, and latency measures. The data were generated by an agentic LLM loop that iteratively writes, executes, and validates code inside a CadQuery environment, covering a broad vocabulary of CAD operations (booleans, fillets, lofts, sweeps, patterns, etc.) across 65 part categories.
The dataset is split into 979,633 training samples and 10,000 each for validation and test. It is stored in Parquet format and can be streamed via the `datasets` library to avoid downloading the full million‑row corpus. Pre‑computed DINOv3 visual embeddings and a FAISS IVF‑PQ index are provided, enabling fast nearest‑neighbor search over the entire collection. Example code in the README shows how to load a sample, execute the CadQuery script, and visualize the generated geometry.
Intended uses include training generative models that produce parametric CAD code from text or images, benchmarking image‑to‑3D reconstruction pipelines, studying the semantics of CAD construction sequences, and building retrieval systems that match visual queries to editable CAD programs. Because the data are entirely synthetic, they are free of proprietary design constraints but may not reflect real‑world manufacturing distributions.
Limitations noted by the authors are the lack of real‑world CAD files, possible incoherent designs despite local validity, and unit conventions that differ from specific engineering standards. Nevertheless, the scale, diversity of operations, and accompanying embeddings make Zero-to-CAD 1M a valuable resource for research at the intersection of AI and computer‑aided design.
Project Ideas
- Fine‑tune a language model on the CadQuery scripts to create a text‑to‑CAD generator that turns natural language part descriptions into executable parametric code.
- Build an image‑based CAD retrieval web app that queries the provided FAISS DINOv3 index to find synthetically similar parts from user‑uploaded sketches or photos.
- Train a multi‑label classifier that predicts the set of CAD operations used (extrude, fillet, loft, etc.) from the eight rendered views, leveraging the `cadquery_ops_json` field as ground truth.
- Develop an interactive demo that streams samples, runs the CadQuery code in a sandbox, and displays the resulting 3‑D model alongside its pre‑rendered views for rapid prototyping.
- Create a regression model that estimates geometry complexity (e.g., number of B‑Rep faces) from code length or token count, using the `num_faces` and `cadquery_file` fields for supervision.