dataset March 25, 2026

OmniAction: A Massive Omni‑modal Dataset for Proactive Robot Manipulation

The OpenMOSS-Team has released **OmniAction**, a large‑scale multimodal dataset designed for contextual instruction following in robotic manipulation. Hosted on HuggingFace, the dataset contains 141,162 episodes that cover 112 manipulation skills and 748 objects. It is enriched with over 5,000 distinct speaker timbres, 2,482 non‑verbal sound events, and 640 environmental background recordings, making it one of the most comprehensive resources for embodied robotics research.

OmniAction is formatted according to the Reinforcement Learning Datasets (RLDS) standard and includes synchronized audio, visual, and environmental sound streams. The episodes are organized into six categories of contextual instructions—sentiment cues, overlapping voices, non‑verbal cues, identity cues, dyadic dialogue, and triadic dialogue—capturing subtle affective signals and complex multi‑party interactions that occur in everyday settings. The dataset’s tags (robotics, any-to-any, audio-to-audio, omni, embodied) and the associated paper (arXiv:2510.23763, accepted to ICLR 2026) highlight its focus on omni‑modal robot intention recognition and proactive assistance.

The dataset underpins the **RoboOmni** framework, a Perceiver‑Thinker‑Talker‑Executor architecture that fuses vision, speech, and environmental sounds to recognize user intent, confirm interactions, and execute actions. Experiments reported in the accompanying paper demonstrate that models trained on OmniAction outperform text‑only and ASR‑based baselines in success rate, inference speed, and proactive assistance, both in simulation and real‑world robot setups.

Distributed under a CC‑BY‑NC‑4.0 license, OmniAction is openly available for academic and research use. Its extensive multimodal recordings and rich instruction types provide a valuable benchmark for developing next‑generation robot manipulation systems that can anticipate and act on human intentions without explicit commands.

Project Ideas

Train a multimodal intention‑recognition model that predicts robot actions from combined speech, environmental sounds, and visual frames using the OmniAction episodes.
Create an interactive simulation environment that replays OmniAction episodes to evaluate robot policy performance under different contextual instruction types.
Develop a data‑visualization dashboard that lets researchers explore the distribution of speakers, sound events, and background scenes across the 141k episodes.
Benchmark existing vision‑language‑action pipelines against OmniAction to measure improvements in proactive assistance and inference speed.
Implement a real‑time robot assistant prototype that consumes live audio‑visual input and uses a fine‑tuned model on OmniAction to generate proactive manipulation commands.

← Back to all reports