
ZendoBench
Testing scientific agency in AI scientists
An interactive benchmark where LLM-based agents infer hidden rules over koans, publish hypotheses, and survive adversarial peer review.
A benchmark for scientific minds
Explore how well AI agents think like scientists—forming hypotheses, running experiments, and updating when proven wrong.
The Big Idea
Imagine a game where someone has a secret rule—like "every structure must have at least one red piece." You don't know the rule. Your job? Build little structures, guess the rule, and see if you're right.
Zendo is that game. We put AI scientists in the same spot: can they discover the rule by experimenting, guessing, and learning from being wrong?
Koans = Little Experiments
A koan is a small arrangement of colored pieces (pyramids, blocks, wedges). Each one either follows or violates the secret rule. Try revealing a few below—can you spot the pattern?
Guess, Then Get Checked
When the AI thinks it knows the rule, it writes it down. Then comes the cool part: peer review. A "Master" tries to disprove the guess by building a counterexample—a structure that breaks the AI's rule but fits the real one.
Good scientists revise when they're wrong. We measure whether AI does that too—or just keeps adding excuses.
Now try a harder one
Real benchmark rules are more complex—they can involve counting, comparing quantities, or tracking multiple attributes at once. Can you figure out this rule from four koans?
What We're Testing
Scientific agency: Can AI design good experiments? State clear hypotheses? Handle being wrong gracefully? This benchmark probes those skills in a controlled, measurable way.
Can you beat the AI?
Put your own scientific reasoning to the test. Discover the hidden rule faster than our best agents—no coding required.
Every good benchmark needs a baseline. We're building a human baseline for ZendoBench—if you're interested in participating, contact us.
Coming Soon
An in-browser version of the Zendo game—build koans, form hypotheses, and challenge the master, all without leaving this page.
Leaderboard & metrics
See how different AI models stack up. Compare sample efficiency, calibration, and scientific reasoning across benchmark runs.
Top performers
Ranked by solve rate & efficiency
Key metrics
Calibration, ad-hocness, refutation handling
Run analysis
Episode-level breakdowns & trends
Visualizations coming soon
Charts, leaderboard tables, and per-model metric breakdowns will appear here once the dataset is published.
Leaderboard · In developmentExplore the benchmark runs
Browse recorded episodes, replay agent reasoning step-by-step, and compare how different models approached the same hidden rule.
0 episodes across 0 runs