Outcome Governance Benchmark · March 2026

Complete tier separation. Zero exceptions.

Twelve AI agent frameworks. Three leading model families. Eight hundred and twenty-eight scored decisions. Every governed framework graded A or B. Every ungoverned framework graded F. The causal information architecture determined the outcome, not the model.

Frameworks

Model families

828

Decisions scored

Exceptions

Download the full report →

The setup

Same test. Every framework.

Identical scenarios across three LLM families, with and without OSR governance. Six scoring criteria, two hundred and ten points total. The setup was designed to isolate what actually drives outcome quality.

Models tested

Claude, GPT-4o, Gemini

Three leading model families, each run through every framework. The same prompts. The same scenarios. The same scoring rubric.

Frameworks

Twelve agent stacks

Single-agent, multi-agent, orchestrated. LangChain, CrewAI, AutoGen, and more. Six with OSR + CPP governance, six ungoverned.

Scenarios

Thirty-six months

Two multi-year simulation scenarios covering demand shocks, capacity constraints, and financial masking. Eight hundred twenty-eight scored decisions.

Key findings

What governance detected. What didn't.

Five detection patterns appeared consistently in the governed cohort. None appeared unprompted in the ungoverned cohort across any model family.

Finding 01

Financial Masking detection

Surface revenue metrics appeared healthy while underlying operational stocks collapsed. Governed frameworks detected it via the SC-05 criterion. No ungoverned framework surfaced the masking unprompted across any of the three model families.

Finding 02

Revenue Inversion

Ungoverned agents optimized the visible revenue line and inverted the underlying outcome. Cash flowed in while the system that produced the cash was eroding. Governance caught the decoupling; pure target-following did not.

Finding 03

The Vitality Trap

Governed frameworks tolerated short-term apparent stagnation to protect long-term system health. Ungoverned frameworks chased short-term motion signals and collapsed by month 24. The governed curves were slower to move and far more durable.

Finding 04

Healthy Metrics Hallucination

When asked to self-assess, ungoverned frameworks produced plausible-sounding health narratives that contradicted the ground truth in the simulation. Governance made the contradiction visible. Without it, the narrative won.

Finding 05

The Orchestration Gap

CrewAI and LangChain received identical OSR specs. The difference in outcome quality between them was negligible under governance and enormous without it. Orchestration framework mattered less than governance presence.

The full report

Read the evidence.

The full March 2026 report covers the complete methodology, all twelve framework scorecards, per-criterion breakdowns across the three model families, and the full set of detection patterns.

Download the full report (PDF) →Read the book

Ungated. No email required.