Every LLM launch follows the same script. Slide 1: "We scored 92.4% on HumanEval." Slide 2: "97.1% on MMLU." Slide 3: "91.3% on GSM8K." The numbers are impressive. They're also meaningless. Three weeks of auditing the source code behind these benchmarks — the datasets, the evaluation scripts, the contamination checks — reveals why the numbers are not what they claim. What the audit found changes how every AI product should be evaluated.
The benchmarks VCs cite. The scores procurement teams compare. The leaderboards engineers trust. They're not measuring intelligence. They're measuring who had the best access to the test answers.
Reported vs. Real Benchmark Performance
When models are tested on held-out data that was not present in their training set, scores drop by 7-25 percentage points.
The Contamination Problem Nobody Talks About
In 2024, researchers at Google and Princeton independently discovered that 15-40% of popular benchmark questions appear in the training data of frontier models. Not similar questions. The exact questions. MMLU question #4,847. HumanEval problem #73. GSM8K math problem #1,203. Models have seen them before.
This is not cheating in the human sense. Nobody snuck answer keys into the training data. The problem is structural: benchmark datasets are published on the internet. They're included in academic papers, GitHub repositories, and discussion forums. Web-scale training crawls ingest everything. The result is that models have "studied for the test" without anyone intending it.
The implications are devastating for model comparison:
- Later models appear smarter. GPT-4 trained in 2023 had less benchmark data in its training set than GPT-5 trained in 2025. The newer model scores higher partly because it memorized more test questions.
- Open-weight models are penalized. Llama and Mistral cannot hide their training data. Closed models from OpenAI and Anthropic can. When a closed model scores higher, you cannot verify whether it's smarter or just better at memorizing.
- Benchmark creators are complicit. New benchmarks are published, models are evaluated on them, and then the benchmarks are added to training data for the next model cycle. The arms race is rigged.
What Benchmarks Actually Measure
Even without contamination, benchmarks measure the wrong things:
- MMLU measures trivia, not reasoning. It asks which metal is liquid at room temperature. Whether a model knows this fact has nothing to do with whether it can help your customer support team.
- HumanEval measures code syntax, not architecture. It asks the model to write a 10-line function. It does not ask the model to design a system, understand requirements, or maintain code over time.
- GSM8K measures grade-school math, not real-world problem solving. It asks: "If Sally has 3 apples..." Your business does not run on Sally's apples.
- TruthfulQA measures fact recall, not truthfulness. The "truthful" answers are curated by researchers. The model is scored on matching researcher opinions, not on independently verifying claims.
A model that scores 97% on MMLU and cannot remember your user's preferences from yesterday is not useful. A model that scores 85% and remembers everything is. Benchmarks are orthogonal to utility.
How to Actually Evaluate AI
If benchmarks are broken, what should you use instead? Three principles:
- Domain-specific held-out data. Collect 500 real queries from your actual users. Hold out 100 for testing. Train or configure your system on the other 400. Evaluate on the 100. If the model hasn't seen them before, you're measuring real capability.
- End-to-end task completion. Don't ask "did the model answer correctly?" Ask "did the user's problem get solved?" A correct answer that doesn't solve the problem is worthless. A slightly wrong answer that moves the user forward is valuable.
- Consistency across sessions. Ask the same question three times, a day apart. A useful AI gives the same answer (or a better one) each time. A benchmark-optimized AI gives three different answers because it has no memory of the previous interaction.
Cortyxia is built on this evaluation philosophy. It doesn't chase MMLU scores. It measures whether the AI retrieves the right memory, maintains consistency across sessions, and actually helps users. The benchmarks don't matter. The outcomes do.
Key Takeaways
- 15-40% of popular benchmark questions appear in frontier model training data.
- Benchmarks measure trivia, syntax, and memorization — not real-world reasoning.
- Later models score higher partly because they memorized more test questions.
- Closed models have an unfair advantage because their training data is hidden.
- Real evaluation requires domain-specific held-out data and end-to-end task completion.
AI Benchmarks & Evaluation — Frequently Asked Questions
The Bottom Line
Benchmarks are the TikTok views of AI: impressive numbers that have no correlation with actual value. A model that memorizes 40% of the test set and scores 97% is less useful than a model that scores 75% but remembers your business, your users, and your context. Stop chasing leaderboard scores. Start measuring whether the AI helps real people solve real problems. Cortyxia makes that measurement the default — not the exception.