model April 20, 2026

HY‑World 2.0: Open‑Source Multi‑Modal 3D World Generation & Reconstruction

HY‑World 2.0 is a pioneering multi‑modal world model released by Tencent that bridges text, images, multi‑view photos, and video to realistic 3D scenes. It is advertised as the first open‑source, state‑of‑the‑art "image‑to‑3D" system, offering both world generation (from text or single‑view prompts) and world reconstruction (from multi‑view images or video) via its image‑to‑3D pipeline. The current release includes WorldMirror 2.0, a Diffusers‑style Python pipeline that ingests a folder of images (or video) and outputs editable 3DGS point clouds, meshes, depth maps, normal maps, and camera parameters. Upcoming modules such as HY‑Pano‑2, WorldNav, and WorldStereo 2.0 will enable full panorama generation and controllable camera‑driven world creation.

Key technical highlights include:

- A Diffusers‑like API (`WorldMirrorPipeline.from_pretrained('tencent/HY-World-2.0')`) for rapid prototyping.

- Support for prior injection (camera extrinsics, intrinsics, depth) to improve reconstruction quality.

- Benchmarks showing WorldMirror 2.0 surpasses prior versions on standard datasets (7‑Scenes, NRGBD, DTU) and WorldStereo 2.0 leads in camera‑control metrics.

- Compatibility with PyTorch 2.4, CUDA 12.4, and optional FlashAttention for performance.

- Gradio demo for interactive web‑based visualization of reconstruction results.

The model is positioned as an open‑source alternative to closed‑source 3D generators, enabling researchers and developers to create editable 3D assets that can be directly imported into game engines (Unity, Unreal) or visualisation tools.

Project Ideas

Create a web service where users upload a single photo and receive a navigable 3D virtual tour generated by the text‑to‑3D pipeline.
Develop a mobile app that records a short indoor video and uses WorldMirror 2.0 to reconstruct a detailed 3D mesh for AR interior design previews.
Integrate WorldMirror 2.0 output into Unity to build an interactive game level populated with assets generated from textual descriptions.
Build a batch processing pipeline that converts large image collections of historic sites into 3DGS models for cultural heritage preservation.
Design a Gradio‑based UI that allows artists to upload reference images and inject custom camera/depth priors to fine‑tune reconstructed 3D assets.

← Back to all reports