dataset March 18, 2026

BONES-SEED: Massive Multimodal Motion Dataset for Humanoid Robotics

BONES-SEED (Skeletal Everyday Embodiment Dataset) is an open collection of 142,220 annotated human motion captures designed for humanoid robotics research. The dataset provides each motion in three skeletal representations—SOMA Uniform, SOMA Proportional, and Unitree G1 MuJoCo‑compatible CSV—covering a total of roughly 288 hours of animation at 120 fps. Every entry includes up to six natural language descriptions, technical biomechanical notes, short index labels, and detailed temporal segmentation, all stored in a 51‑column parquet metadata file.

The motions span eight top‑level packages (e.g., Locomotion, Communication, Dances) and 20 fine‑grained categories, with actors ranging from 17 to 71 years old and diverse body measurements. Formats are ready for direct use with NVIDIA's SOMA toolkit and the Unitree G1 robot, enabling seamless simulation‑to‑real transfer via MuJoCo‑compatible joint trajectories. The dataset is licensed under a custom Bones‑SEED license and requires gated access fields to ensure appropriate use.

Intended applications highlighted by Bones Studio include language‑conditioned whole‑body control, text‑to‑motion generation, motion retrieval via natural language queries, imitation learning from diverse human demonstrations, and activity recognition through temporal segmentation. The rich multimodal annotations make it a valuable resource for training and evaluating models that map language to robot actions or that need large‑scale, high‑fidelity motion data.

The dataset can be accessed through the Hugging Face Hub, Git LFS, or the Hugging Face CLI, and metadata can be loaded directly with pandas. An interactive viewer and associated code are provided to explore motions visually, facilitating rapid prototyping for researchers and developers in robotics, computer vision, and AI.

Project Ideas

Train a language‑conditioned policy for a Unitree G1 robot using the natural language descriptions and G1 joint trajectories.
Build a natural‑language motion search engine that indexes the dataset by its short and natural descriptions for fast retrieval.
Develop a text‑to‑motion generation model that outputs SOMA Uniform BVH files from user‑provided action prompts.
Create an imitation‑learning pipeline that uses the temporal segmentation labels to teach a simulated humanoid to perform multi‑phase tasks.
Design a visual analytics dashboard that visualizes actor biometrics, motion categories, and style variations across the dataset.

← Back to all reports