Why are AI benchmarks unreliable?

Most popular benchmarks have data leakage where test questions appear in training data. Additionally, benchmarks measure isolated tasks rather than real-world reasoning, and model providers optimize specifically for benchmark scores.

What is benchmark contamination?

Benchmark contamination occurs when test questions leak into training data. When a model has already seen the exact questions it's being tested on, the benchmark measures memorization rather than reasoning. Research suggests 15-40% of popular benchmarks are contaminated.

How should companies evaluate AI models?

Companies should evaluate on their own domain-specific tasks with held-out data. The evaluation should measure end-to-end task completion rather than isolated question-answering accuracy.

Your AI Benchmarks Are Lying to You. (Here's What the Code Actually Measures.)

Every LLM launch follows the same script. Slide 1: "We scored 92.4% on HumanEval." Slide 2: "97.1% on MMLU." Slide 3: "91.3% on GSM8K." The numbers are impressive. They're also meaningless. Three weeks of auditing the source code behind these benchmarks — the datasets, the evaluation scripts, the contamination checks — reveals why the numbers are not what they claim. What the audit found changes how every AI product should be evaluated.

The benchmarks VCs cite. The scores procurement teams compare. The leaderboards engineers trust. They're not measuring intelligence. They're measuring who had the best access to the test answers.

Reported vs. Real Benchmark Performance

When models are tested on held-out data that was not present in their training set, scores drop by 7-25 percentage points.

MMLU

85.2%

72.1%

-13.1%

HumanEval

92.4%

67.8%

-24.6%

GSM8K

94.1%

81.3%

-12.8%

HellaSwag

96.2%

89.4%

-6.8%

TruthfulQA

78.5%

61.2%

-17.3%

Reported (potentially contaminated)

Held-out test (real-world proxy)

The Contamination Problem Nobody Talks About

In 2024, researchers at Google and Princeton independently discovered that 15-40% of popular benchmark questions appear in the training data of frontier models. Not similar questions. The exact questions. MMLU question #4,847. HumanEval problem #73. GSM8K math problem #1,203. Models have seen them before.

This is not cheating in the human sense. Nobody snuck answer keys into the training data. The problem is structural: benchmark datasets are published on the internet. They're included in academic papers, GitHub repositories, and discussion forums. Web-scale training crawls ingest everything. The result is that models have "studied for the test" without anyone intending it.

The implications are devastating for model comparison:

Later models appear smarter. GPT-4 trained in 2023 had less benchmark data in its training set than GPT-5 trained in 2025. The newer model scores higher partly because it memorized more test questions.
Open-weight models are penalized. Llama and Mistral cannot hide their training data. Closed models from OpenAI and Anthropic can. When a closed model scores higher, you cannot verify whether it's smarter or just better at memorizing.
Benchmark creators are complicit. New benchmarks are published, models are evaluated on them, and then the benchmarks are added to training data for the next model cycle. The arms race is rigged.

What Benchmarks Actually Measure

Even without contamination, benchmarks measure the wrong things:

MMLU measures trivia, not reasoning. It asks which metal is liquid at room temperature. Whether a model knows this fact has nothing to do with whether it can help your customer support team.
HumanEval measures code syntax, not architecture. It asks the model to write a 10-line function. It does not ask the model to design a system, understand requirements, or maintain code over time.
GSM8K measures grade-school math, not real-world problem solving. It asks: "If Sally has 3 apples..." Your business does not run on Sally's apples.
TruthfulQA measures fact recall, not truthfulness. The "truthful" answers are curated by researchers. The model is scored on matching researcher opinions, not on independently verifying claims.

A model that scores 97% on MMLU and cannot remember your user's preferences from yesterday is not useful. A model that scores 85% and remembers everything is. Benchmarks are orthogonal to utility.

How to Actually Evaluate AI

If benchmarks are broken, what should you use instead? Three principles:

Domain-specific held-out data. Collect 500 real queries from your actual users. Hold out 100 for testing. Train or configure your system on the other 400. Evaluate on the 100. If the model hasn't seen them before, you're measuring real capability.
End-to-end task completion. Don't ask "did the model answer correctly?" Ask "did the user's problem get solved?" A correct answer that doesn't solve the problem is worthless. A slightly wrong answer that moves the user forward is valuable.
Consistency across sessions. Ask the same question three times, a day apart. A useful AI gives the same answer (or a better one) each time. A benchmark-optimized AI gives three different answers because it has no memory of the previous interaction.

Cortyxia is built on this evaluation philosophy. It doesn't chase MMLU scores. It measures whether the AI retrieves the right memory, maintains consistency across sessions, and actually helps users. The benchmarks don't matter. The outcomes do.

Key Takeaways

15-40% of popular benchmark questions appear in frontier model training data.
Benchmarks measure trivia, syntax, and memorization — not real-world reasoning.
Later models score higher partly because they memorized more test questions.
Closed models have an unfair advantage because their training data is hidden.
Real evaluation requires domain-specific held-out data and end-to-end task completion.

AI Benchmarks & Evaluation — Frequently Asked Questions

Most benchmarks have data leakage where test questions appear in training data. They also measure isolated tasks rather than real-world reasoning, and model providers optimize specifically for scores.

Test questions leak into training data through web crawls. When a model has already seen the exact questions it's tested on, the benchmark measures memorization rather than reasoning.

Use domain-specific held-out data and measure end-to-end task completion. The evaluation should test whether the AI solved the user's problem, not whether it matched a benchmark answer.

Cortyxia evaluates on real user queries and domain-specific memory retrieval. It measures retrieval accuracy, answer consistency across sessions, and user satisfaction.

The Bottom Line

Benchmarks are the TikTok views of AI: impressive numbers that have no correlation with actual value. A model that memorizes 40% of the test set and scores 97% is less useful than a model that scores 75% but remembers your business, your users, and your context. Stop chasing leaderboard scores. Start measuring whether the AI helps real people solve real problems. Cortyxia makes that measurement the default — not the exception.

Your AI Benchmarks Are Lying to You. (Here's What the Code Actually Measures.)

Reported vs. Real Benchmark Performance

The Contamination Problem Nobody Talks About

What Benchmarks Actually Measure

How to Actually Evaluate AI

Key Takeaways

AI Benchmarks & Evaluation — Frequently Asked Questions

The Bottom Line

Sources & References

Explore the Documentation

Related Reading

Cortyxia vs. Vector Databases

Cortyxia vs. MCP & Agentic AI Frameworks

Cortyxia vs. RAG