Classhopper Set

Fine-Tuning GPT OSS 120B for Coding Tasks Using the HUD Reinforcement Learning Platform

Reinforcement LearningCode GenerationHUD PlatformGRPO

We used HUD's RL platform to fine-tune GPT OSS 120B on 100 real-world bug-fixing tasks built from a production codebase. Training with GRPO yielded a +13% improvement (Best@10 runs), and fewer steps per task.

1. Environment Creation

We started with Classhopper, one of our older production apps. It's an Airbnb-style service for discovering and booking classes near you. The codebase was pre-AI era: functional, deployed, but messy. Perfect for realistic coding challenges.

We merged frontend and backend into a single monorepo, verified builds and tests passed, and confirmed it was stable enough for automated evaluation.

We then connected it using HUD's coding environment template. The template handles the heavy lifting of environment creation — it ships with a Dockerfile, a grading harness, and a task runner, so you don't need to build any of that infrastructure yourself. Out of the box, it gives the agent two built-in tools: a bash tool for running shell commands, and an editor tool for viewing, creating, and editing files. These are the only tools the agent gets: no web access, no special APIs, just a terminal and a file editor — the same primitives a human developer would use.

To connect our codebase, we forked the template and set the REPO_URL build argument in the Dockerfile to point at our Classhopper monorepo. The template clones this repo into the container at build time and wires it into the task runner automatically.

2. Task Creation

Branch Structure

HUD uses a three-branch structure per task ("scenario"), based on their coding template:

  • Baseline: The buggy code the agent starts with.
  • Test: Tests that check the fix and catch regressions.
  • Golden: The correct working code. We mostly kept the original source, cleaning it up where needed.

Bug Design

We built 25 initial bugs across frontend-only, backend-only, and cross-stack categories at easy, medium, and hard difficulties. We then injected these bugs into the code base.

Example bug prompt

"You will be working on a task for project. The repository has already been cloned in /home/ubuntu/project."

Use the tools provided to complete the following task:

Fix the course visibility toggle bug in the Classhopper backend.

The "make all courses visible" endpoint for instructors does the opposite of what it should. After calling PUT /instructors/{id}/courses/visible, all courses become hidden instead of visible.

You MUST edit the relevant file(s) to fix the bug. Do not just describe the fix."

Test Design & Reward Signal

Each bug got tests for the fix itself plus adjacent areas to catch regressions. Reward is binary: pass all tests or fail the task. No partial credit. Binary rewards give the strongest training signal since the model can't get away with partial fixes.

3. Task Validation

Early mistake: we jumped straight to running agents without verifying the HUD build config was correct. Tasks that look fine locally can silently break when config doesn't apply tests properly.

The uv run imagectl4.py <img-name> -v --ids <task-1> <task-2> command fixed this. It checks that everything builds, tests apply to the right branches, baseline tests fail (bug exists), and golden tests pass (fix works). After running validation across all tasks, we could trust that any agent failure was a real performance issue, not a config bug.

4. Model Training

Initial Evaluation

With 25 validated tasks, we created a taskset on HUD and batch ran it against the base GPT OSS 120B. After fixing a few config issues, we got a solid distribution of success rates: some tasks the model solved easily, some it struggled with, some it couldn't crack.

Scaling to 100 Tasks

We talked to the HUD team about what makes a good training set. Based on their guidance, we built 75 more tasks, keeping diversity across stack types and difficulties while targeting a similar success distribution.

Final dataset: 100 validated tasks.

DifficultyFrontend OnlyBackend OnlyCross-StackTotal
Easy1314633
Medium1216836
Hard1014731
Total354421100
Table 1: Task distribution by stack type and difficulty

Training Run

We ran GRPO training on a HUD fork of GPT OSS 120B, investing ~10 hours and 600 credits over 20 training steps. The model's policy showed a clear shift, with pass rates climbing steadily throughout the run. While you might notice performance dips at checkpoints #5 and #11, these weren't regressions — they were simply the result of a more difficult distribution of tasks in those specific evaluations. Overall, the trajectory remained strong.

5. Results

Evaluation Setup

We benchmarked this newly trained model on 50 unseen tasks: 25 new Classhopper tasks and 25 from ScheduleHero, a completely separate app. The out-of-domain eval on ScheduleHero was key to confirming the model gained real coding skill, not just Classhopper memorization.

Performance

Consistent improvement across every metric:

MetricBase GPT OSS 120BTrained ModelImprovement
Average Pass Rate53.9%60.7%+6.8%
Best@368.2%77.9%+9.7%
Best@570.8%82.9%+12.1%
Best@1073.9%86.9%+13.0%
Pass@180.0%88.0%+8.0%
ScheduleHero (out-of-domain)14%22%+8.0%
Avg Steps26.222.2-4.0 steps
Table 2: Benchmark results on classhopper-benchv1 (25 held-out tasks)

Key Takeaways

+6.8%
Avg Pass Rate
+12.1%
Best@5
+8.0%
Pass@1
-4
Avg Steps
  • +6.8% average pass rate (53.9% → 60.7%). More tasks solved per run.
  • +12.1% Best@5 (70.8% → 82.9%). Way more tasks solvable given multiple attempts.
  • +8.0% Pass@1 (80.0% → 88.0%). Better first-attempt reliability.
  • 4 fewer steps on average (26.2 → 22.2). Not just more accurate, but more efficient.
  • +8.0% improvement on an unseen code base and significantly harder tasks.