LIVE BENCHMARKS

AI model benchmarks — 35+ frontier models, scored independently

Intelligence Index, GPQA Diamond, MATH-500, LiveCodeBench, MMLU-Pro, AIME 2025 — refreshed as labs publish. Strip out vendor charts; see the head-to-head numbers.

Try the top models free Compare models

Every major AI lab publishes a self-congratulatory benchmark chart with the day they release a model. Council AI's benchmarks page aggregates third-party-verified scores across the suites that actually matter in 2026: Intelligence Index (composite), GPQA Diamond (PhD-level science), MATH-500 (frontier math), LiveCodeBench (recent unseen competitive programming), MMLU-Pro (expert-domain reasoning), AIME 2025 (recent unseen math). We surface the same numbers vendors quietly cite in their own docs, side-by-side, with the date each was last refreshed. For most production decisions, the right move isn't picking the top of any one column — it's running 2-3 of these in a council and letting the moderator synthesize.

What each benchmark measures

Intelligence Index — composite score from Artificial Analysis combining reasoning, knowledge, and coding suites. Best single-number proxy for general capability.
GPQA Diamond — graduate-level science Q&A; the "hard" subset experts get ~65%+ on.
MATH-500 — competition math; tests step-by-step reasoning, not pattern matching.
LiveCodeBench — competitive programming problems released AFTER each model's training cutoff. Resistant to contamination.
MMLU-Pro — expert-domain reasoning across 14 disciplines; the harder successor to MMLU.
AIME 2025 — American Invitational Math Exam 2025; recent enough to be unseen by most training corpora.

How to read the live numbers

A model that wins Intelligence Index but loses LiveCodeBench is great at reasoning narratives, weak at producing working code under time pressure. A model that wins MATH-500 but loses MMLU-Pro is strong at math-shape problems, weaker on cross-domain knowledge. There is no model that wins every suite; pick by the suite closest to your job, or run a council and trust the moderator's synthesis.

Why not trust vendor benchmarks

Every lab publishes the suites where they win. They train extensively to win specific benchmarks. The way to spot this is to look for "live" benchmarks — ones released after the model's training cutoff (LiveCodeBench, AIME 2025) — and to cross-reference multiple suites. A model dominating only one suite is suspicious; a model in the top quartile of three suites is real.

Frequently asked questions

How often are benchmarks updated?

We refresh scores from Artificial Analysis, LiveCodeBench, and lab-published reports weekly. Each row shows a 'last updated' date so you can spot stale numbers.

Can I run my own benchmark on these models?

Yes. Sign up free, select the models, and run your own prompts. Free users get up to 3 models per chat; Pro and Ultra get up to 10 — enough to bench all the frontier labs in one go.

Why is one lab missing from a benchmark?

If a row is blank, the lab hasn't released a score and the third-party benchmark hasn't run them yet. We don't fabricate numbers.