Classhopper Set
Fine-Tuning GPT OSS 120B for Coding Tasks Using the HUD Reinforcement Learning Platform
We used HUD's RL platform to fine-tune GPT OSS 120B on 100 real-world bug-fixing tasks built from a production codebase. Training with GRPO yielded a +13% improvement (Best@10 runs), and fewer steps per task.
1. Environment Creation
We started with Classhopper, one of our older production apps. It's an Airbnb-style service for discovering and booking classes near you. The codebase was pre-AI era: functional, deployed, but messy. Perfect for realistic coding challenges.
We merged frontend and backend into a single monorepo, verified builds and tests passed, and confirmed it was stable enough for automated evaluation.
We then connected it using HUD's coding environment template. The template handles the heavy lifting of environment creation — it ships with a Dockerfile, a grading harness, and a task runner, so you don't need to build any of that infrastructure yourself. Out of the box, it gives the agent two built-in tools: a bash tool for running shell commands, and an editor tool for viewing, creating, and editing files. These are the only tools the agent gets: no web access, no special APIs, just a terminal and a file editor — the same primitives a human developer would use.
To connect our codebase, we forked the template and set the REPO_URL build argument in the Dockerfile to point at our Classhopper monorepo. The template clones this repo into the container at build time and wires it into the task runner automatically.
2. Task Creation
Branch Structure
HUD uses a three-branch structure per task ("scenario"), based on their coding template:
- Baseline: The buggy code the agent starts with.
- Test: Tests that check the fix and catch regressions.
- Golden: The correct working code. We mostly kept the original source, cleaning it up where needed.
Bug Design
We built 25 initial bugs across frontend-only, backend-only, and cross-stack categories at easy, medium, and hard difficulties. We then injected these bugs into the code base.
"You will be working on a task for project. The repository has already been cloned in /home/ubuntu/project."
Use the tools provided to complete the following task:
Fix the course visibility toggle bug in the Classhopper backend.
The "make all courses visible" endpoint for instructors does the opposite of what it should. After calling PUT /instructors/{id}/courses/visible, all courses become hidden instead of visible.
You MUST edit the relevant file(s) to fix the bug. Do not just describe the fix."
Test Design & Reward Signal
Each bug got tests for the fix itself plus adjacent areas to catch regressions. Reward is binary: pass all tests or fail the task. No partial credit. Binary rewards give the strongest training signal since the model can't get away with partial fixes.
3. Task Validation
Early mistake: we jumped straight to running agents without verifying the HUD build config was correct. Tasks that look fine locally can silently break when config doesn't apply tests properly.
The uv run imagectl4.py <img-name> -v --ids <task-1> <task-2> command fixed this. It checks that everything builds, tests apply to the right branches, baseline tests fail (bug exists), and golden tests pass (fix works). After running validation across all tasks, we could trust that any agent failure was a real performance issue, not a config bug.
4. Model Training
Initial Evaluation
With 25 validated tasks, we created a taskset on HUD and batch ran it against the base GPT OSS 120B. After fixing a few config issues, we got a solid distribution of success rates: some tasks the model solved easily, some it struggled with, some it couldn't crack.
Scaling to 100 Tasks
We talked to the HUD team about what makes a good training set. Based on their guidance, we built 75 more tasks, keeping diversity across stack types and difficulties while targeting a similar success distribution.
Final dataset: 100 validated tasks.
| Difficulty | Frontend Only | Backend Only | Cross-Stack | Total |
|---|---|---|---|---|
| Easy | 13 | 14 | 6 | 33 |
| Medium | 12 | 16 | 8 | 36 |
| Hard | 10 | 14 | 7 | 31 |
| Total | 35 | 44 | 21 | 100 |
Training Run
We ran GRPO training on a HUD fork of GPT OSS 120B, investing ~10 hours and 600 credits over 20 training steps. The model's policy showed a clear shift, with pass rates climbing steadily throughout the run. While you might notice performance dips at checkpoints #5 and #11, these weren't regressions — they were simply the result of a more difficult distribution of tasks in those specific evaluations. Overall, the trajectory remained strong.
5. Results
Evaluation Setup
We benchmarked this newly trained model on 50 unseen tasks: 25 new Classhopper tasks and 25 from ScheduleHero, a completely separate app. The out-of-domain eval on ScheduleHero was key to confirming the model gained real coding skill, not just Classhopper memorization.
Performance
Consistent improvement across every metric:
| Metric | Base GPT OSS 120B | Trained Model | Improvement |
|---|---|---|---|
| Average Pass Rate | 53.9% | 60.7% | +6.8% |
| Best@3 | 68.2% | 77.9% | +9.7% |
| Best@5 | 70.8% | 82.9% | +12.1% |
| Best@10 | 73.9% | 86.9% | +13.0% |
| Pass@1 | 80.0% | 88.0% | +8.0% |
| ScheduleHero (out-of-domain) | 14% | 22% | +8.0% |
| Avg Steps | 26.2 | 22.2 | -4.0 steps |
Key Takeaways
- +6.8% average pass rate (53.9% → 60.7%). More tasks solved per run.
- +12.1% Best@5 (70.8% → 82.9%). Way more tasks solvable given multiple attempts.
- +8.0% Pass@1 (80.0% → 88.0%). Better first-attempt reliability.
- 4 fewer steps on average (26.2 → 22.2). Not just more accurate, but more efficient.
- +8.0% improvement on an unseen code base and significantly harder tasks.
