MGMT 590 · LLM Evaluation Track · April 2026Interactive Playground

Two votes can flip the throne.

On Chatbot Arena, a noise budget of 0.003% — just two battles out of 57,477 — is enough to unseat the top-ranked LLM (Huang et al. 2025). That's the share of votes that could plausibly be off in any leaderboard: ambiguous prompts, rater disagreement, selective reporting. Bootstrap confidence intervals don't catch it. The playground below runs on real arena-human-preference-140k votes — verify the fragility yourself, then watch our framework fix it.

0.003%

Votes that flip #1 on Arena

6,000×

Arena vs MT-Bench fragility gap

2M+

Battles audited (Singh et al.)

Components in our framework

Scroll. Below, you control four levers — dataset, estimator, drop rule, and α — and watch the leaderboard refit live.

Primer

Bradley–Terry, in 30 seconds

Most leaderboards (Chatbot Arena, RewardBench, MT-Bench) compress millions of pairwise battles into one number per model. Here is how that compression works.

How a vote becomes a score

Prompt

User submits a question to two anonymous models, A and B.

Two responses

Both responses appear side-by-side, models hidden.

Vote

User picks: A wins, B wins, or tie.

BT fit

All votes go into one MLE. Each model gets a single score.

The model

Each model $i$ gets a latent strength $\beta_i$ . The probability that $i$ beats $j$ is logistic in the gap.

P(i \succ j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} = \sigma(\beta_i - \beta_j)

Estimate $\hat{\beta}$ by maximum likelihood over millions of votes. Equivalent to logistic regression on a one-hot design. Convergent in seconds.

Wide gaps

Robust ranks

Narrow gaps

Fragile ranks

Below in the playground, the Dataset preset lever switches between real Chatbot Arena votes (narrow gaps, real raters) and a synthetic MT-Bench foil (wide gaps, no raters).

Two failure modes

Statistical fragility meets systemic bias

Arena-style leaderboards face two independent reasons to doubt the number on screen. Both happen at the same time. Both compound.

Fragility · Huang et al. 2025

Removing 0.003% of votes flips the #1 LLM

Two preferences out of 57,477. AMIP catches it. Bootstrap CIs miss it because they assume IID resampling, not adversarial removal.

Chatbot Arena0.003%

MT-Bench18.1%

Source: arXiv:2508.11847

Governance · Singh et al. 2025

The Leaderboard Illusion

An audit of ~2M battles, 243 models, 42 providers. The system itself is biased upstream of any statistical fragility — and the two compound.

Meta variants tested before LLaMA-4 launch

Providers can test multiple private variants and only release the best.

205

models silently removed

Removal without notification breaks the consistency of the comparison set.

112%

relative gain on ArenaHard from extra Arena data

Providers with more Arena data tune harder to the benchmark, not to general tasks.

~40%

of all votes go to two providers

Google ~19.2% + OpenAI ~20.4%, while 83 open-weight models share ~29.7%.

Source: arXiv:2504.20879

These are not independent problems. Selective reporting introduces extreme data points; those are precisely the votes AMIP flags as high-leverage. Treat α as your noise budget — the share of votes you'd consider unreliable for any reason: voter disagreement, prompt drift, rater inconsistency, bot-vote filtering, or selective reporting. In the playground below, you'll do the experiment yourself: pick a regime, pick how that noise is distributed, then drag α and watch the rank flip.

The killer demo

Drop a few votes. Watch the leaderboard shuffle.

Start with real Arena votes, then drop a tiny fraction and watch rankings change instantly.

What does “removing votes” actually mean?

It's a sensitivity test. We're asking: if a small fraction of votes were noisy, missing, or biased, would the rank still hold?

Treat α as a noise budget: drop that fraction of votes and see if the ranking still holds.

Voter disagreementPrompt driftRater inconsistencyBot-vote filteringSelective reportingSampling variance

Your four leversevery change refits BT live

Dataset preset

Which leaderboard regime are we simulating?

Choose real Arena votes or a synthetic foil for contrast.

CurrentlyArena · real votes. Real Chatbot Arena votes: gemini-2.5-pro, chatgpt-4o-latest, o3, gemini-2.5-flash and 8 more, with the actual matchup distribution from the dataset. The regime where Arena-style fragility lives.

Estimator

How is the Bradley–Terry model fit?

Standard fit vs a defense that down-weights high-leverage votes.

CurrentlyVanilla BT. Standard BT-MLE: every vote has weight 1. This is what published Arena, MT-Bench, and RewardBench leaderboards do today.

Drop rule

How is that noise distributed across votes?

Targeted worst-case drops vs uniform random drops.

CurrentlyAMIP worst-case. Models worst-case noise: the votes with the highest computed influence on the rank-1 vs rank-2 gap are the ones that get perturbed. Maps to selective reporting, adversarial filtering, or any process that disproportionately affects high-leverage battles.

Drop fraction · α

How much noise do we allow?

Increase α to drop more votes and see when ranks start to break.

α =0.000%·0 / 3,000 votes

CurrentlyNothing dropped. Move the slider to start.

Try this · four scenarios that tell the whole story

Live readout

The leaderboard you would see right now, given your four levers.

0 of 3,000 votes dropped

Influence histogram

Votes ranked by leverage; amber marks dropped votes.

low influencehigh influence →

Top-pair α_flip

1.73%

Top-1 now

gemini-2.5-pro

Robustness-Aware Leaderboard

Live BT refit · AMIP-ranked drop · α = 0.000%

Δ rank · was

Model

Elo

95% CI

α_flip · adjacent pair

Status

held

gemini-2.5-pro

Google DeepMind

1117

[1065, 1169]

1.73%

ROBUST

held

chatgpt-4o-latest

OpenAI

1053

[999, 1107]

0.033%

FRAGILE

held

OpenAI

1052

[999, 1105]

0.333%

MODERATE

held

gemini-2.5-flash

Google DeepMind

1040

[989, 1091]

0.933%

MODERATE

held

qwen3-235b-a22b-no-thinking

Alibaba

1010

[958, 1061]

0.333%

MODERATE

held

gemma-3-27b-it

Google

997

[943, 1051]

0.033%

FRAGILE

held

mistral-medium-2505

Mistral AI

997

[946, 1048]

0.800%

MODERATE

held

claude-opus-4

Anthropic

971

[937, 1005]

0.167%

MODERATE

held

claude-sonnet-4

Anthropic

966

[915, 1017]

0.300%

MODERATE

held

command-a-03-2025

Cohere

954

[901, 1008]

0.700%

MODERATE

held

claude-3-7-sonnet-20250219-thinking-32k

Anthropic

925

[872, 977]

0.133%

MODERATE

held

claude-3-7-sonnet

Anthropic

919

[863, 974]

>50%

ROBUST

12 models·3,000 real votes·seed 1729·of 15,925 between these models·Real Chatbot Arena votes between the 12 most-active models on lmarena-ai/arena-human-preference-140k. β / CI fit on all 15,925 votes between them; the in-browser slider runs on a deterministic 3,000-vote uniform subsample so AMIP stays responsive.

Our proposal

Robustness-Aware Leaderboards

Three components, one closed loop. Each is a drop-in extension of the existing Chatbot Arena pipeline. None requires new human infrastructure.

Diagnostic01

Robustness intervals

Apply AMIP to each adjacent pair on the leaderboard. Report α_flip alongside every BT score. O(N) cost on top of one BT fit. Drop-in column.

Model

Elo · CI

α_flip

Status

GPT-5

1412 [1405, 1419]

0.0031%

FRAGILE

Claude-4.6

1407 [1399, 1414]

0.41%

MODERATE

Gemini-3

1398 [1389, 1407]

2.8%

ROBUST

Extends Broderick et al. 2020; Huang et al. 2025

Prescriptive02

Influence-gain sampling

Steer the matchup selector toward fragile pairs. Generalizes Fisher-information sampling. Closes the loop: audit → action.

Top-pair α_flip vs new votes spent

Simulating samplers...

Extends Chiang et al. 2024; Frick et al. 2025

Robust estimation03

Influence-capped BT

Cap the top 0.1% of votes by influence magnitude. Or use a Huberized BT loss. Reduces fragility before it is reported.

Per-vote influence distribution

kept

capped (top 0.1%)

Builds on Hunter 2004; Huber-style M-estimators

Putting it together

Three components, one self-correcting pipeline

Each component is a drop-in extension of the existing Arena pipeline — none requires new human infrastructure. The leverage comes from how they compose: the audit (C1) reads the robust fit (C3), the sampler (C2) acts on the audit, and the next refresh picks up new votes that already reflect the previous loop.

One leaderboard refresh ≈ one loop

Step 01

Fit, robustly

Influence-capped BT

outβ̂ that zero-weights the top 0.1% of votes by leverage

Step 02

Diagnose fragility

Robustness intervals · AMIP

outα_flip column on every adjacent pair

Step 03

Spend new votes

Influence-gain sampling

outNext-batch matchup queue — concentrated where α_flip is lowest

↺

New battles from step 03 feed step 01's next refit. What every reader sees alongside Elo is the α_flip column from step 02 — and it tightens automatically as step 03 retires fragile pairs. Below, the loop is put on trial: each step ships with a falsifiable claim (C1, C2, C3) that has to pass live on this page.

Stress-testing the loop

Three claims, three live verdicts

The loop above only matters if the claims behind it hold. Each component (C1 audit, C2 sampler, C3 fit) maps 1:1 to a testable claim with the same number. The full validation runs offline on the entire lmarena-ai/arena-human-preference-140k dataset; the same code, on a 1.5k-vote real-arena subsample bundled in this app, executes here in your browser the moment you scroll.

Live execution · this device

real 12-model Chatbot Arena · 1,500 votes · deterministic seed

Verdicts:PassPartial passNeeds more evidence·Same code path as the playground above — lib/amip.ts, lib/cap.ts, lib/sampler.ts.

Live · Claim C1·Calibration

α_flip is well-calibrated. Pairs the audit flags as fragile actually flip easier than randomly-perturbed pairs.

Live testCompute α_flip on the top 5 adjacent pairs under AMIP and under uniform-random drop. AMIP should win on every pair.

ComponentRobustness intervals

Computing α_flip across 5 pairs · AMIP and random ordering

Full test · 140k offline

Train / held-out split on 140k · predict α_flip on train · verify on held-out.

Live · Claim C2·Sampling efficiency

Influence-gain sampling reaches a target α_flip with fewer new votes than information-gain or uniform sampling.

Live testFrom the same starting BT fit, each sampler queues 30 matchups its way; we draw each outcome as Bernoulli at σ(β_a−β_b), append, warm-restart BT, and recompute α_flip on the top pair. Repeat 6× for a 180-vote budget — the line that climbs fastest wins.

ComponentInfluence-gain sampling

What each sampler prefers

Uniform — every pair equally likely. Today's matchmaking default and the implicit assumption behind bootstrap CIs.
Info-gain — close-call, under-sampled pairs. Weight ∝ σ(z)(1−σ(z)) ÷ √battles. Classical active learning.
Influence-gain — same, narrow-gap-weighted, with a 5× boost on the rank-1 vs rank-2 pair we're hardening. Our proposal: spend votes where AMIP says they'll move α_flip the most.

Simulating 3 samplers × 180-vote budget · refitting BT every 30 votes

Full test · 140k offline

Simulate three samplers on 140k · measure votes-to-target curves on the slowest-tightening pairs.

Live · Claim C3·Robust aggregation

Influence-capped BT yields a fit with higher α_flip than vanilla BT, at no cost in held-out predictive log-likelihood.

Live testCap the top 0.1% of votes by aggregate AMIP influence, refit BT, recompute α_flip on the top 5 adjacent pairs.

ComponentInfluence-capped BT

Refitting BT under both estimators · recomputing α_flip on 5 pairs

Full test · 140k offline

Refit both estimators on 140k · compare log-likelihood and α_flip distributions head-to-head.

Why a 1,500-vote subsample? The bundled dataset is 1,500 real Chatbot Arena votes from arena-human-preference-140k between the 12 most-active models — fewer than the full 15.9k between them, but more than enough to keep the same algorithmic verdict while each card refits BT in well under a second on a laptop. The full-data runs are described in each card's footer and ship in the project repo.

The team

Built for MGMT 590 — LLM Alignment & Evaluation

Purdue University · presented April 2026.

Rygel Ginete

Vikhyat Yashvanth Koppal

Lichen Mao

The interactive playground and the three live evaluation cards run on REAL Chatbot Arena votes from lmarena-ai/arena-human-preference-140k. We precompute a Bradley–Terry fit on the 15.9k votes between the top-12 most-active models, bundle a deterministic 3,000-vote subsample for the slider (β / CI from the full fit, not the subsample), and ship a 20-model Elo snapshot for the side-panel. The MT-Bench preset is intentionally synthetic — kept as a wide-gap foil to the real arena. No PII; no runtime network calls.

Papers we lean on

primary

Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Huang et al. (2025) · arXiv:2508.11847

context

The Leaderboard Illusion

Singh et al. (2025) · arXiv:2504.20879

method

An Automatic Finite-Sample Robustness Metric (AMIP)

Broderick, Giordano & Meager (2020) · arXiv:2011.14999

platform

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang et al. (2024) · arXiv:2403.04132

model

Rank Analysis of Incomplete Block Designs

Bradley & Terry (1952)

arena-human-preference-140k lmarena.ai

Robustness-Aware Leaderboards · April 2026 · Purdue University