ZendoBench — a fantasy landscape representing the world of AI scientific reasoning

ZendoBench

Testing scientific agency in AI scientists

An interactive benchmark where LLM-based agents infer hidden rules over koans, publish hypotheses, and survive adversarial peer review.

Scroll
What is Zendo(Bench)?

A benchmark for scientific minds

Explore how well AI agents think like scientists—forming hypotheses, running experiments, and updating when proven wrong.

The Big Idea

Imagine a game where someone has a secret rule—like "every structure must have at least one red piece." You don't know the rule. Your job? Build little structures, guess the rule, and see if you're right.

Zendo is that game. We put AI scientists in the same spot: can they discover the rule by experimenting, guessing, and learning from being wrong?

Koans = Little Experiments

A koan is a small arrangement of colored pieces (pyramids, blocks, wedges). Each one either follows or violates the secret rule. Try revealing a few below—can you spot the pattern?

1 red pyramid, 1 blue blockClick to reveal
2 red piecesClick to reveal
1 red wedge, 1 green blockClick to reveal

Guess, Then Get Checked

When the AI thinks it knows the rule, it writes it down. Then comes the cool part: peer review. A "Master" tries to disprove the guess by building a counterexample—a structure that breaks the AI's rule but fits the real one.

Good scientists revise when they're wrong. We measure whether AI does that too—or just keeps adding excuses.

Now try a harder one

Real benchmark rules are more complex—they can involve counting, comparing quantities, or tracking multiple attributes at once. Can you figure out this rule from four koans?

2 pyramids, 1 blockClick to reveal
1 pyramid, 2 blocksClick to reveal
1 pyramid, 1 wedge, 1 blockClick to reveal
1 pyramid, 3 blocksClick to reveal

What We're Testing

Scientific agency: Can AI design good experiments? State clear hypotheses? Handle being wrong gracefully? This benchmark probes those skills in a controlled, measurable way.

Play it yourself

Can you beat the AI?

Put your own scientific reasoning to the test. Discover the hidden rule faster than our best agents—no coding required.

Every good benchmark needs a baseline. We're building a human baseline for ZendoBench—if you're interested in participating, contact us.

Coming Soon

An in-browser version of the Zendo game—build koans, form hypotheses, and challenge the master, all without leaving this page.

Interactive game · In development
Statistics

Leaderboard & metrics

See how different AI models stack up. Compare sample efficiency, calibration, and scientific reasoning across benchmark runs.

Top performers

Ranked by solve rate & efficiency

Key metrics

Calibration, ad-hocness, refutation handling

Run analysis

Episode-level breakdowns & trends

Visualizations coming soon

Charts, leaderboard tables, and per-model metric breakdowns will appear here once the dataset is published.

Leaderboard · In development
Lab Notebook

Explore the benchmark runs

Browse recorded episodes, replay agent reasoning step-by-step, and compare how different models approached the same hidden rule.

0 episodes across 0 runs

Loading runs from backend…