MGMT 590 · LLM Evaluation Track · April 2026Interactive Playground

Two votes can flip the throne.

On Chatbot Arena, a noise budget of 0.003% — just two battles out of 57,477 — is enough to unseat the top-ranked LLM (Huang et al. 2025). That's the share of votes that could plausibly be off in any leaderboard: ambiguous prompts, rater disagreement, selective reporting. Bootstrap confidence intervals don't catch it. The playground below runs on real arena-human-preference-140k votes — verify the fragility yourself, then watch our framework fix it.

0.003%
Votes that flip #1 on Arena
6,000×
Arena vs MT-Bench fragility gap
2M+
Battles audited (Singh et al.)
3
Components in our framework
Scroll. Below, you control four levers — dataset, estimator, drop rule, and α — and watch the leaderboard refit live.
Throne illustration
Primer

Bradley–Terry, in 30 seconds

Most leaderboards (Chatbot Arena, RewardBench, MT-Bench) compress millions of pairwise battles into one number per model. Here is how that compression works.

How a vote becomes a score
01
Prompt
User submits a question to two anonymous models, A and B.
02
Two responses
Both responses appear side-by-side, models hidden.
03
Vote
User picks: A wins, B wins, or tie.
04
BT fit
All votes go into one MLE. Each model gets a single score.
The model

Each model ii gets a latent strength βi\beta_i. The probability that ii beats jj is logistic in the gap.

P(ij)=eβieβi+eβj=σ(βiβj)P(i \succ j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} = \sigma(\beta_i - \beta_j)

Estimate β^\hat{\beta} by maximum likelihood over millions of votes. Equivalent to logistic regression on a one-hot design. Convergent in seconds.

Wide gaps
Robust ranks
Narrow gaps
Fragile ranks

Below in the playground, the Dataset preset lever switches between real Chatbot Arena votes (narrow gaps, real raters) and a synthetic MT-Bench foil (wide gaps, no raters).

Two failure modes

Statistical fragility meets systemic bias

Arena-style leaderboards face two independent reasons to doubt the number on screen. Both happen at the same time. Both compound.

Fragility · Huang et al. 2025

Removing 0.003% of votes flips the #1 LLM

Two preferences out of 57,477. AMIP catches it. Bootstrap CIs miss it because they assume IID resampling, not adversarial removal.

Chatbot Arena0.003%
MT-Bench18.1%
Source: arXiv:2508.11847
Governance · Singh et al. 2025

The Leaderboard Illusion

An audit of ~2M battles, 243 models, 42 providers. The system itself is biased upstream of any statistical fragility — and the two compound.

27
Meta variants tested before LLaMA-4 launch
Providers can test multiple private variants and only release the best.
205
models silently removed
Removal without notification breaks the consistency of the comparison set.
112%
relative gain on ArenaHard from extra Arena data
Providers with more Arena data tune harder to the benchmark, not to general tasks.
~40%
of all votes go to two providers
Google ~19.2% + OpenAI ~20.4%, while 83 open-weight models share ~29.7%.
Source: arXiv:2504.20879
These are not independent problems. Selective reporting introduces extreme data points; those are precisely the votes AMIP flags as high-leverage. Treat α as your noise budget — the share of votes you'd consider unreliable for any reason: voter disagreement, prompt drift, rater inconsistency, bot-vote filtering, or selective reporting. In the playground below, you'll do the experiment yourself: pick a regime, pick how that noise is distributed, then drag α and watch the rank flip.
The killer demo

Drop a few votes. Watch the leaderboard shuffle.

Start with real Arena votes, then drop a tiny fraction and watch rankings change instantly.

What does “removing votes” actually mean?
It's a sensitivity test. We're asking: if a small fraction of votes were noisy, missing, or biased, would the rank still hold?

Treat α as a noise budget: drop that fraction of votes and see if the ranking still holds.

Voter disagreementPrompt driftRater inconsistencyBot-vote filteringSelective reportingSampling variance
Your four leversevery change refits BT live
01
Dataset preset
Which leaderboard regime are we simulating?

Choose real Arena votes or a synthetic foil for contrast.

CurrentlyArena · real votes. Real Chatbot Arena votes: gemini-2.5-pro, chatgpt-4o-latest, o3, gemini-2.5-flash and 8 more, with the actual matchup distribution from the dataset. The regime where Arena-style fragility lives.
02
Estimator
How is the Bradley–Terry model fit?

Standard fit vs a defense that down-weights high-leverage votes.

CurrentlyVanilla BT. Standard BT-MLE: every vote has weight 1. This is what published Arena, MT-Bench, and RewardBench leaderboards do today.
03
Drop rule
How is that noise distributed across votes?

Targeted worst-case drops vs uniform random drops.

CurrentlyAMIP worst-case. Models worst-case noise: the votes with the highest computed influence on the rank-1 vs rank-2 gap are the ones that get perturbed. Maps to selective reporting, adversarial filtering, or any process that disproportionately affects high-leverage battles.
04
Drop fraction · α
How much noise do we allow?

Increase α to drop more votes and see when ranks start to break.

α =0.000%·0 / 3,000 votes
CurrentlyNothing dropped. Move the slider to start.
Try this · four scenarios that tell the whole story
Live readout
The leaderboard you would see right now, given your four levers.
0 of 3,000 votes dropped
Influence histogram
Votes ranked by leverage; amber marks dropped votes.
low influencehigh influence →
Top-pair α_flip
1.73%
Top-1 now
gemini-2.5-pro
Robustness-Aware Leaderboard
Live BT refit · AMIP-ranked drop · α = 0.000%
1
held
gemini-2.5-pro
Google DeepMind
1117
[1065, 1169]
1.73%
ROBUST
2
held
chatgpt-4o-latest
OpenAI
1053
[999, 1107]
0.033%
FRAGILE
3
held
o3
OpenAI
1052
[999, 1105]
0.333%
MODERATE
4
held
gemini-2.5-flash
Google DeepMind
1040
[989, 1091]
0.933%
MODERATE
5
held
qwen3-235b-a22b-no-thinking
Alibaba
1010
[958, 1061]
0.333%
MODERATE
6
held
gemma-3-27b-it
Google
997
[943, 1051]
0.033%
FRAGILE
7
held
mistral-medium-2505
Mistral AI
997
[946, 1048]
0.800%
MODERATE
8
held
claude-opus-4
Anthropic
971
[937, 1005]
0.167%
MODERATE
9
held
claude-sonnet-4
Anthropic
966
[915, 1017]
0.300%
MODERATE
10
held
command-a-03-2025
Cohere
954
[901, 1008]
0.700%
MODERATE
11
held
claude-3-7-sonnet-20250219-thinking-32k
Anthropic
925
[872, 977]
0.133%
MODERATE
12
held
claude-3-7-sonnet
Anthropic
919
[863, 974]
>50%
ROBUST
12 models·3,000 real votes·seed 1729·of 15,925 between these models·Real Chatbot Arena votes between the 12 most-active models on lmarena-ai/arena-human-preference-140k. β / CI fit on all 15,925 votes between them; the in-browser slider runs on a deterministic 3,000-vote uniform subsample so AMIP stays responsive.
Our proposal

Robustness-Aware Leaderboards

Three components, one closed loop. Each is a drop-in extension of the existing Chatbot Arena pipeline. None requires new human infrastructure.

Diagnostic01

Robustness intervals

Apply AMIP to each adjacent pair on the leaderboard. Report α_flip alongside every BT score. O(N) cost on top of one BT fit. Drop-in column.

Model
Elo · CI
α_flip
Status
GPT-5
1412 [1405, 1419]
0.0031%
FRAGILE
Claude-4.6
1407 [1399, 1414]
0.41%
MODERATE
Gemini-3
1398 [1389, 1407]
2.8%
ROBUST
Extends Broderick et al. 2020; Huang et al. 2025
Prescriptive02

Influence-gain sampling

Steer the matchup selector toward fragile pairs. Generalizes Fisher-information sampling. Closes the loop: audit → action.

Top-pair α_flip vs new votes spent
Simulating samplers...
Extends Chiang et al. 2024; Frick et al. 2025
Robust estimation03

Influence-capped BT

Cap the top 0.1% of votes by influence magnitude. Or use a Huberized BT loss. Reduces fragility before it is reported.

Per-vote influence distribution
kept
capped (top 0.1%)
Builds on Hunter 2004; Huber-style M-estimators
Putting it together

Three components, one self-correcting pipeline

Each component is a drop-in extension of the existing Arena pipeline — none requires new human infrastructure. The leverage comes from how they compose: the audit (C1) reads the robust fit (C3), the sampler (C2) acts on the audit, and the next refresh picks up new votes that already reflect the previous loop.

One leaderboard refresh ≈ one loop
Step 01
Fit, robustly
Influence-capped BT
outβ̂ that zero-weights the top 0.1% of votes by leverage
Step 02
Diagnose fragility
Robustness intervals · AMIP
outα_flip column on every adjacent pair
Step 03
Spend new votes
Influence-gain sampling
outNext-batch matchup queue — concentrated where α_flip is lowest

New battles from step 03 feed step 01's next refit. What every reader sees alongside Elo is the α_flip column from step 02 — and it tightens automatically as step 03 retires fragile pairs. Below, the loop is put on trial: each step ships with a falsifiable claim (C1, C2, C3) that has to pass live on this page.

Stress-testing the loop

Three claims, three live verdicts

The loop above only matters if the claims behind it hold. Each component (C1 audit, C2 sampler, C3 fit) maps 1:1 to a testable claim with the same number. The full validation runs offline on the entire lmarena-ai/arena-human-preference-140k dataset; the same code, on a 1.5k-vote real-arena subsample bundled in this app, executes here in your browser the moment you scroll.

Live execution · this device
real 12-model Chatbot Arena · 1,500 votes · deterministic seed
Verdicts:PassPartial passNeeds more evidence·Same code path as the playground above — lib/amip.ts, lib/cap.ts, lib/sampler.ts.
Live · Claim C1·Calibration

α_flip is well-calibrated. Pairs the audit flags as fragile actually flip easier than randomly-perturbed pairs.

Live testCompute α_flip on the top 5 adjacent pairs under AMIP and under uniform-random drop. AMIP should win on every pair.

ComponentRobustness intervals

Computing α_flip across 5 pairs · AMIP and random ordering
Full test · 140k offline

Train / held-out split on 140k · predict α_flip on train · verify on held-out.

Live · Claim C2·Sampling efficiency

Influence-gain sampling reaches a target α_flip with fewer new votes than information-gain or uniform sampling.

Live testFrom the same starting BT fit, each sampler queues 30 matchups its way; we draw each outcome as Bernoulli at σ(β_a−β_b), append, warm-restart BT, and recompute α_flip on the top pair. Repeat 6× for a 180-vote budget — the line that climbs fastest wins.

ComponentInfluence-gain sampling

What each sampler prefers
  • Uniform — every pair equally likely. Today's matchmaking default and the implicit assumption behind bootstrap CIs.
  • Info-gain — close-call, under-sampled pairs. Weight ∝ σ(z)(1−σ(z)) ÷ √battles. Classical active learning.
  • Influence-gain — same, narrow-gap-weighted, with a 5× boost on the rank-1 vs rank-2 pair we're hardening. Our proposal: spend votes where AMIP says they'll move α_flip the most.
Simulating 3 samplers × 180-vote budget · refitting BT every 30 votes
Full test · 140k offline

Simulate three samplers on 140k · measure votes-to-target curves on the slowest-tightening pairs.

Live · Claim C3·Robust aggregation

Influence-capped BT yields a fit with higher α_flip than vanilla BT, at no cost in held-out predictive log-likelihood.

Live testCap the top 0.1% of votes by aggregate AMIP influence, refit BT, recompute α_flip on the top 5 adjacent pairs.

ComponentInfluence-capped BT

Refitting BT under both estimators · recomputing α_flip on 5 pairs
Full test · 140k offline

Refit both estimators on 140k · compare log-likelihood and α_flip distributions head-to-head.

Why a 1,500-vote subsample? The bundled dataset is 1,500 real Chatbot Arena votes from arena-human-preference-140k between the 12 most-active models — fewer than the full 15.9k between them, but more than enough to keep the same algorithmic verdict while each card refits BT in well under a second on a laptop. The full-data runs are described in each card's footer and ship in the project repo.

The team

Built for MGMT 590 — LLM Alignment & Evaluation

Purdue University · presented April 2026.
Rygel Ginete
Vikhyat Yashvanth Koppal
Lichen Mao
The interactive playground and the three live evaluation cards run on REAL Chatbot Arena votes from lmarena-ai/arena-human-preference-140k. We precompute a Bradley–Terry fit on the 15.9k votes between the top-12 most-active models, bundle a deterministic 3,000-vote subsample for the slider (β / CI from the full fit, not the subsample), and ship a 20-model Elo snapshot for the side-panel. The MT-Bench preset is intentionally synthetic — kept as a wide-gap foil to the real arena. No PII; no runtime network calls.
Papers we lean on
Robustness-Aware Leaderboards · April 2026 · Purdue University