Sovereign AI
Public benchmark

Reproducible. Skeptical. Public.

Our 45-question crypto benchmark graded on an A–F scale by independent graders + automated rubric. Same questions, same grading, every model. Run it yourself.

Headline results

ModelA-rateA+B combinedAccess
Sovereign v2 (14B SFT)87%98%open
GPT-4o68%91%closed
Claude 3.5 Haiku64%88%closed
Llama 3 70B59%82%open
Qwen3-14B (base, no fine-tune)47%69%open

Numbers updated each time we train a new version. Last updated: v6-SFT promoted 2026-04-16. See the research post for full methodology and failure analysis.

By category

CategorySovereign v2Qwen3-14B base
DeFi mechanicsAB
MEV / PBS / FCFSAC
TokenomicsAB
L2 / rollup designAB
Solana-specificAC
Smart money / wallet clusteringCD
Regime detectionBC
Security / attack vectorsBB

Weak spots we're actively training against: smart-money / wallet clustering and regime detection.

Methodology

The benchmark is a fixed set of 45 crypto questions spanning the categories above. Each question has a rubric-graded expected answer (key facts, common misconceptions, required caveats).

Each model's responses are graded A (complete + correct), B (correct with minor gaps), C (partially correct), D (mostly wrong), or F (wrong). Grading is done by two independent graders plus an automated rubric-matching step; disagreements are resolved by a third grader.

The full eval set, grading rubric, and reproducibility script are in our public GitHub repo. No hidden test set.

Reproduce

git clone https://github.com/sovereignai/eval
cd eval
pip install -r requirements.txt

# Run against our API
python run_benchmark.py --model sovereign-v2 --api-base https://api.sovereignai.xyz

# Or any OpenAI-compatible endpoint
python run_benchmark.py --model gpt-4o --api-base https://api.openai.com/v1