Two votes can flip the throne.
On Chatbot Arena, a noise budget of 0.003% — just two battles out of 57,477 — is enough to unseat the top-ranked LLM (Huang et al. 2025). That's the share of votes that could plausibly be off in any leaderboard: ambiguous prompts, rater disagreement, selective reporting. Bootstrap confidence intervals don't catch it. The playground below runs on real arena-human-preference-140k votes — verify the fragility yourself, then watch our framework fix it.

Bradley–Terry, in 30 seconds
Most leaderboards (Chatbot Arena, RewardBench, MT-Bench) compress millions of pairwise battles into one number per model. Here is how that compression works.
Each model gets a latent strength . The probability that beats is logistic in the gap.
Estimate by maximum likelihood over millions of votes. Equivalent to logistic regression on a one-hot design. Convergent in seconds.
Below in the playground, the Dataset preset lever switches between real Chatbot Arena votes (narrow gaps, real raters) and a synthetic MT-Bench foil (wide gaps, no raters).
Statistical fragility meets systemic bias
Arena-style leaderboards face two independent reasons to doubt the number on screen. Both happen at the same time. Both compound.
Removing 0.003% of votes flips the #1 LLM
Two preferences out of 57,477. AMIP catches it. Bootstrap CIs miss it because they assume IID resampling, not adversarial removal.
The Leaderboard Illusion
An audit of ~2M battles, 243 models, 42 providers. The system itself is biased upstream of any statistical fragility — and the two compound.
Drop a few votes. Watch the leaderboard shuffle.
Start with real Arena votes, then drop a tiny fraction and watch rankings change instantly.
Treat α as a noise budget: drop that fraction of votes and see if the ranking still holds.
Choose real Arena votes or a synthetic foil for contrast.
Standard fit vs a defense that down-weights high-leverage votes.
Targeted worst-case drops vs uniform random drops.
Increase α to drop more votes and see when ranks start to break.
Robustness-Aware Leaderboards
Three components, one closed loop. Each is a drop-in extension of the existing Chatbot Arena pipeline. None requires new human infrastructure.
Robustness intervals
Apply AMIP to each adjacent pair on the leaderboard. Report α_flip alongside every BT score. O(N) cost on top of one BT fit. Drop-in column.
Influence-gain sampling
Steer the matchup selector toward fragile pairs. Generalizes Fisher-information sampling. Closes the loop: audit → action.
Influence-capped BT
Cap the top 0.1% of votes by influence magnitude. Or use a Huberized BT loss. Reduces fragility before it is reported.
Three components, one self-correcting pipeline
Each component is a drop-in extension of the existing Arena pipeline — none requires new human infrastructure. The leverage comes from how they compose: the audit (C1) reads the robust fit (C3), the sampler (C2) acts on the audit, and the next refresh picks up new votes that already reflect the previous loop.
New battles from step 03 feed step 01's next refit. What every reader sees alongside Elo is the α_flip column from step 02 — and it tightens automatically as step 03 retires fragile pairs. Below, the loop is put on trial: each step ships with a falsifiable claim (C1, C2, C3) that has to pass live on this page.
Three claims, three live verdicts
The loop above only matters if the claims behind it hold. Each component (C1 audit, C2 sampler, C3 fit) maps 1:1 to a testable claim with the same number. The full validation runs offline on the entire lmarena-ai/arena-human-preference-140k dataset; the same code, on a 1.5k-vote real-arena subsample bundled in this app, executes here in your browser the moment you scroll.
lib/amip.ts, lib/cap.ts, lib/sampler.ts.α_flip is well-calibrated. Pairs the audit flags as fragile actually flip easier than randomly-perturbed pairs.
Live testCompute α_flip on the top 5 adjacent pairs under AMIP and under uniform-random drop. AMIP should win on every pair.
ComponentRobustness intervals
Train / held-out split on 140k · predict α_flip on train · verify on held-out.
Influence-gain sampling reaches a target α_flip with fewer new votes than information-gain or uniform sampling.
Live testFrom the same starting BT fit, each sampler queues 30 matchups its way; we draw each outcome as Bernoulli at σ(β_a−β_b), append, warm-restart BT, and recompute α_flip on the top pair. Repeat 6× for a 180-vote budget — the line that climbs fastest wins.
ComponentInfluence-gain sampling
- Uniform — every pair equally likely. Today's matchmaking default and the implicit assumption behind bootstrap CIs.
- Info-gain — close-call, under-sampled pairs. Weight ∝
σ(z)(1−σ(z))÷√battles. Classical active learning. - Influence-gain — same, narrow-gap-weighted, with a 5× boost on the rank-1 vs rank-2 pair we're hardening. Our proposal: spend votes where AMIP says they'll move
α_flipthe most.
Simulate three samplers on 140k · measure votes-to-target curves on the slowest-tightening pairs.
Influence-capped BT yields a fit with higher α_flip than vanilla BT, at no cost in held-out predictive log-likelihood.
Live testCap the top 0.1% of votes by aggregate AMIP influence, refit BT, recompute α_flip on the top 5 adjacent pairs.
ComponentInfluence-capped BT
Refit both estimators on 140k · compare log-likelihood and α_flip distributions head-to-head.
Why a 1,500-vote subsample? The bundled dataset is 1,500 real Chatbot Arena votes from arena-human-preference-140k between the 12 most-active models — fewer than the full 15.9k between them, but more than enough to keep the same algorithmic verdict while each card refits BT in well under a second on a laptop. The full-data runs are described in each card's footer and ship in the project repo.
Built for MGMT 590 — LLM Alignment & Evaluation
lmarena-ai/arena-human-preference-140k. We precompute a Bradley–Terry fit on the 15.9k votes between the top-12 most-active models, bundle a deterministic 3,000-vote subsample for the slider (β / CI from the full fit, not the subsample), and ship a 20-model Elo snapshot for the side-panel. The MT-Bench preset is intentionally synthetic — kept as a wide-gap foil to the real arena. No PII; no runtime network calls.