dataset March 15, 2026

Open-RL: Verifiable STEM Reasoning Dataset for Outcome‑Supervised RL

The **Open-RL** dataset, released by **TuringEnterprises** on March 2, 2026, offers a compact collection (<1K entries) of self‑contained, verifiable STEM reasoning problems spanning physics, mathematics, chemistry, and biology. Each entry contains a `conversation_id`, the problem `domain` and `sub_domain`, a LaTeX‑supported `question`, and a deterministic `answer` expressed as a symbolic or numeric solution. The dataset is stored in JSON format and is licensed under the permissive MIT license.

Designed explicitly for **question‑answering** tasks, Open‑RL emphasizes multi‑step reasoning, symbolic manipulation, and objective verifiability. Answers are deterministic, enabling automatic grading via exact string matching, symbolic equivalence (e.g., SymPy), or numerical tolerance checks. The data quality is ensured by double‑review from PhD experts, guaranteeing scientific accuracy and clear separation between reasoning steps and final outcomes.

Open‑RL targets **reinforcement learning (RL) and outcome‑supervised training** of large language models. Because rewards can be computed directly from the correctness of the final answer (e.g., +1 for a match, 0 or –1 otherwise), the dataset supports policy‑gradient methods such as PPO, reward‑model bootstrapping, and curriculum‑style training without relying on subjective preference signals. The README also outlines future extensions involving structured “nudges” to help models learn from otherwise unsolvable problems, positioning the dataset as a testbed for upper‑bound expansion in reasoning models.

With its modest size, clear evaluation protocol, and focus on deterministic STEM problems, Open‑RL is well‑suited for researchers looking to benchmark reasoning capabilities, develop RL pipelines, or explore outcome‑based reward modeling for scientific question answering.

Project Ideas

Fine‑tune a large language model with PPO using Open‑RL as the reward source, assigning +1 for exact answer matches.
Create an automated evaluation suite that grades model outputs via symbolic equivalence checking with SymPy on the dataset.
Benchmark a range of open‑source LLMs on Open‑RL to compare their multi‑step STEM reasoning performance.
Design a curriculum‑learning pipeline that starts with easier sub‑domains (e.g., algebra) and progressively adds harder physics questions.
Implement a NuRL‑style hint generator that provides abstract nudges for problems where the base model receives zero reward.

← Back to all reports