Reproducible. Skeptical. Public.
Our 45-question crypto benchmark graded on an A–F scale by independent graders + automated rubric. Same questions, same grading, every model. Run it yourself.
Headline results
| Model | A-rate | A+B combined | Access |
|---|---|---|---|
| Sovereign v2 (14B SFT) | 87% | 98% | open |
| GPT-4o | 68% | 91% | closed |
| Claude 3.5 Haiku | 64% | 88% | closed |
| Llama 3 70B | 59% | 82% | open |
| Qwen3-14B (base, no fine-tune) | 47% | 69% | open |
Numbers updated each time we train a new version. Last updated: v6-SFT promoted 2026-04-16. See the research post for full methodology and failure analysis.
By category
| Category | Sovereign v2 | Qwen3-14B base |
|---|---|---|
| DeFi mechanics | A | B |
| MEV / PBS / FCFS | A | C |
| Tokenomics | A | B |
| L2 / rollup design | A | B |
| Solana-specific | A | C |
| Smart money / wallet clustering | C | D |
| Regime detection | B | C |
| Security / attack vectors | B | B |
Weak spots we're actively training against: smart-money / wallet clustering and regime detection.
Methodology
The benchmark is a fixed set of 45 crypto questions spanning the categories above. Each question has a rubric-graded expected answer (key facts, common misconceptions, required caveats).
Each model's responses are graded A (complete + correct), B (correct with minor gaps), C (partially correct), D (mostly wrong), or F (wrong). Grading is done by two independent graders plus an automated rubric-matching step; disagreements are resolved by a third grader.
The full eval set, grading rubric, and reproducibility script are in our public GitHub repo. No hidden test set.
Reproduce
git clone https://github.com/sovereignai/eval cd eval pip install -r requirements.txt # Run against our API python run_benchmark.py --model sovereign-v2 --api-base https://api.sovereignai.xyz # Or any OpenAI-compatible endpoint python run_benchmark.py --model gpt-4o --api-base https://api.openai.com/v1