dataset May 05, 2026

TaskTrove: 750K Agentic Tasks for RL & SFT Evaluation

TaskTrove is an open‑source collection of more than 750,000 unique agentic tasks gathered from over 100 public sources. Released by the OpenThoughts‑Agent team, the dataset aggregates popular reinforcement‑learning and supervised‑fine‑tuning (SFT) training targets such as SWE‑Smith, R2EGym, and SWE‑Re‑Bench. Stored as Parquet shards, the collection falls in the 100K‑1M size category and is licensed under Apache‑2.0.

Each source dataset is kept in its original form inside a subdirectory named `org__name/`, preserving raw files, READMEs, and metadata without transformation. All tasks are valid Harbor task binaries – the open‑source evaluation and data‑generation framework used by OpenThoughts‑Agent. Tasks are split into two groups: those with a `verifier` field for RL training and evaluation, and those without verifiers intended for SFT or external grading. This structure lets researchers plug the tasks directly into Harbor's RL/eval pipelines or use them as instruction sets for language model fine‑tuning.

TaskTrove serves as the task counterpart to the AgentTrove trace dataset, where agent trajectories are generated by running models against these task binaries. The README provides concrete commands for extracting individual JSON tasks from the Parquet files and for launching Harbor datagen jobs with vLLM‑served models. By linking task binaries to trace generation, TaskTrove enables large‑scale benchmarking, curriculum creation, and synthetic trace production for emerging LLMs.

Project Ideas

  1. Fine‑tune a code‑generation model on the non‑verifier tasks (e.g., bash_textbook_tasks) to improve scripting abilities.
  2. Run a reinforcement‑learning loop using the verifier‑enabled tasks (e.g., swe_rebench_patched) to train an agent with Harbor's RL pipeline.
  3. Benchmark multiple LLMs on SWE‑Smith tasks by extracting them and measuring verifier scores for each model's output.
  4. Create a curriculum learning schedule that gradually introduces harder verification tasks from TaskTrove to a base model.
  5. Generate a new trace dataset by extracting tasks and running a custom vLLM model through Harbor, then publish the traces for community use.
← Back to all reports