Sovereign AI
2026-04-19 · 2 min read

The A-rate benchmark: how we grade crypto LLMs

Our 45-question crypto benchmark is graded on an A–F scale by two reviewers plus a rubric matcher. Here's exactly how, and why it's harder to game than multiple-choice.

Most public LLM benchmarks are multiple-choice quizzes. MMLU, HellaSwag, ARC — they're useful for comparing raw capability, but they don't tell you whether a model can write a coherent paragraph explaining MEV to a non-expert. Ours does.

The setup

  • 45 questions across 8 categories: DeFi mechanics, MEV/PBS/FCFS, tokenomics, L2 design, Solana-specific, smart-money / wallet clustering, regime detection, security.
  • Each question has a rubric: a list of facts that must appear, common misconceptions that must not, and required caveats (e.g. "must mention that slashing is a real risk in LRTs").
  • Grading scale: A (complete + correct), B (correct with minor gaps), C (partially correct), D (mostly wrong), F (wrong).
  • Two human graders + one rubric matcher. Disagreement among the three triggers a third human reviewer.

Why A–F, not 0–1

Numbers-only grading forces reviewers to collapse nuance. A response that gets the mechanism right but misses the risk disclaimer is a B in our system — a 0.87 in a numeric system, which looks fine and is actually a meaningful miss. A–F captures the kind of failure.

How we prevent gaming

  • The eval set is small (45 questions) but the rubrics are strict. A model fine-tuned on our exact questions would need to also match our rubric style without being explicitly trained on it, which is hard without us noticing.
  • Reviewers rotate. No reviewer sees the same question twice in a row.
  • We publish the eval set — anyone can reproduce it.

Current scores

| Model | A-rate | A+B | | --- | --- | --- | | Sovereign v2 v6-SFT (14B) | 87% | 98% | | GPT-4o (closed) | 68% | 91% | | Claude 3.5 Haiku | 64% | 88% | | Llama 3 70B | 59% | 82% | | Qwen3-14B (no fine-tune) | 47% | 69% |

Where we fail

Our model graded C on smart-money / wallet clustering. Our training data underrepresents this category; we're fixing it with a curriculum-weighted v7.

Why publish this

Because if we don't, no one can argue with our numbers — and if no one can argue, the numbers are worthless. Run the benchmark yourself. Argue with the rubrics. File PRs to add questions.