The A-rate benchmark: how we grade crypto LLMs

Most public LLM benchmarks are multiple-choice quizzes. MMLU, HellaSwag, ARC — they're useful for comparing raw capability, but they don't tell you whether a model can write a coherent paragraph explaining MEV to a non-expert. Ours does.

The setup

45 questions across 8 categories: DeFi mechanics, MEV/PBS/FCFS, tokenomics, L2 design, Solana-specific, smart-money / wallet clustering, regime detection, security.
Each question has a rubric: a list of facts that must appear, common misconceptions that must not, and required caveats (e.g. "must mention that slashing is a real risk in LRTs").
Grading scale: A (complete + correct), B (correct with minor gaps), C (partially correct), D (mostly wrong), F (wrong).
Two human graders + one rubric matcher. Disagreement among the three triggers a third human reviewer.

Why A–F, not 0–1

Numbers-only grading forces reviewers to collapse nuance. A response that gets the mechanism right but misses the risk disclaimer is a B in our system — a 0.87 in a numeric system, which looks fine and is actually a meaningful miss. A–F captures the kind of failure.

How we prevent gaming

The eval set is small (45 questions) but the rubrics are strict. A model fine-tuned on our exact questions would need to also match our rubric style without being explicitly trained on it, which is hard without us noticing.
Reviewers rotate. No reviewer sees the same question twice in a row.
We publish the eval set — anyone can reproduce it.

Current scores

| Model | A-rate | A+B | | --- | --- | --- | | Sovereign v2 v6-SFT (14B) | 87% | 98% | | GPT-4o (closed) | 68% | 91% | | Claude 3.5 Haiku | 64% | 88% | | Llama 3 70B | 59% | 82% | | Qwen3-14B (no fine-tune) | 47% | 69% |

Where we fail

Our model graded C on smart-money / wallet clustering. Our training data underrepresents this category; we're fixing it with a curriculum-weighted v7.

Why publish this

Because if we don't, no one can argue with our numbers — and if no one can argue, the numbers are worthless. Run the benchmark yourself. Argue with the rubrics. File PRs to add questions.