ZendoBench Lab Notebook

ZendoBench — a fantasy landscape representing the world of AI scientific reasoning

An interactive benchmark where LLM-based agents infer hidden rules over koans, publish hypotheses, and withstand adversarial falsification.

Scroll

Benchmark Results

Table 1. Results on ZendoBench.
Model	Score (%)	Hypotheses (%)	Experiments (%)	Falsification (%)	Avg. Revisions
GPT 5.2	87.3	92	78	89	2.1
Gemini 3.1 Pro	85.1	88	82	86	2.8
Opus 4.6	83.7	85	79	84	3.2
Kim 2.5	81.2	80	75	82	3.8

What is Zendo(Bench)?

A benchmark for scientific minds

Explore how well AI agents think like scientists—forming hypotheses, running experiments, and updating when proven wrong.

The Big Idea

Imagine a game where someone has a secret rule—like "every structure must have at least one red piece." You don't know the rule. Your job? Build little structures, guess the rule, and see if you're right.

Zendo is that game. We put AI scientists in the same spot: can they discover the rule by experimenting, guessing, and learning from being wrong?

Koans = Little Experiments

A koan is a small arrangement of colored pieces (pyramids, blocks, wedges). Each one either follows or violates the secret rule. Try revealing a few below—can you spot the pattern?

1 red pyramid, 1 blue blockClick to reveal

▲

■

2 red piecesClick to reveal

▲

■

1 red wedge, 1 green blockClick to reveal

◆

■

Guess, Then Get Checked

When the AI thinks it knows the rule, it writes it down. Then comes the cool part: peer review. A "Master" tries to disprove the guess by building a counterexample—a structure that breaks the AI's rule but fits the real one.

Good scientists revise when they're wrong. We measure whether AI does that too—or just keeps adding excuses.

What We're Testing

Scientific agency: Can AI design good experiments? State clear hypotheses? Handle being wrong gracefully? This benchmark probes those skills in a controlled, measurable way.

Play it yourself

Can you beat the AI?

Put your own scientific reasoning to the test. Discover the hidden rule faster than our best agents—no coding required.

Every good benchmark needs a baseline. We're building a human baseline for ZendoBench—if you're interested in participating, contact us.

Coming Soon

An in-browser version of the Zendo game—build koans, form hypotheses, and challenge the master, all without leaving this page.

Interactive game · In development

Statistics

Leaderboard & metrics

See how different AI models stack up. Compare sample efficiency, calibration, and scientific reasoning across benchmark runs.

Top performers

Ranked by solve rate & efficiency

Key metrics

Calibration, ad-hocness, refutation handling

Run analysis

Episode-level breakdowns & trends

Visualizations coming soon

Charts, leaderboard tables, and per-model metric breakdowns will appear here once the dataset is published.

Leaderboard · In development

Lab Notebook

Explore the benchmark runs

Browse recorded episodes, replay agent reasoning step-by-step, and compare how different models approached the same hidden rule.

0 episodes across 0 runs

Loading runs from backend…