DPO regressed our model. Here's what happened.

Direct Preference Optimization is supposed to make your model better at following human preferences. For us, it made it worse — dropping a strong SFT baseline from 87% A-rate to 78%. We rolled the DPO version back. Here's what went wrong.

Short version

Our DPO preference pairs weren't diverse enough. We generated them by having the model produce multiple candidate answers and auto-labeling which was "chosen" vs "rejected". The auto-labeler was itself an LLM with systematic biases — which we then trained into the model.

What we did

Trained SFT v6 on 2,500 curated records. Result: 87% A-rate on our public benchmark. Promoted to sovereign-v2:latest.
Generated 260 DPO preference pairs by having the model produce 4 candidates per question, then scoring them with an auto-grader.
Trained DPO on top of the v6 SFT weights (LoRA rank 16, lr 5e-7, beta 0.1, max_length 512).
Re-ran the benchmark.

Result: 78% A-rate on the DPO model. Five categories that were A on v6 dropped to B or C.

Why (we think)

The preference pairs had low contrast

Most pairs were "model output vs. model output with a minor stylistic change". Not "correct answer vs. wrong answer". DPO interprets "slightly less structured" as a meaningful preference signal and overfits to it.

Our auto-grader was the model itself

We used a grader that was a closely-related LLM. It had correlated errors with the model we were training. So DPO was effectively teaching the model "agree with your sibling's biases."

max_length gotcha

First attempts with max_length=1024 crashed the V100 with CUDA illegal memory access (second epoch OOM despite headroom on first). We dropped to 512 and it ran — but 512 truncated some long-form answers mid-thought, making some "rejected" pairs artificially worse for a reason unrelated to quality.

What we're changing

For v7:

Higher-contrast pairs: we're now generating preferences by deliberately constructing wrong answers (factual errors inserted by a separate model) vs correct answers, not by taking two slightly different variations.
Human-validated labels on a 10% sample: before training DPO, we manually label 10% of pairs to check whether the auto-labeler agrees with us. If agreement < 80%, we don't train.
Regression tests: we benchmark the DPO model on the same categories before committing the merge. If any category regresses by more than 1 grade, we don't promote.

The lesson

DPO is not a "free improvement" over a strong SFT model. It's a specific technique for teaching preferences, and if your preference signal is weak or biased, you'll learn the weakness or bias. Benchmarks are the only thing that tells you whether your fine-tuning worked.

We're publishing this postmortem because it's more useful than another "we trained a new SOTA model!" post. Real training runs regress. Real teams roll them back.