Outcome Governance Benchmark · March 2026

Complete tier separation. Zero exceptions.

Twelve AI agent frameworks. Three leading model families. Eight hundred and twenty-eight scored decisions. Every governed framework graded A or B. Every ungoverned framework graded F. The causal information architecture determined the outcome, not the model.

12
Frameworks
3
Model families
828
Decisions scored
0
Exceptions

The setup

Same test. Every framework.

Identical scenarios across three LLM families, with and without OSR governance. Six scoring criteria, two hundred and ten points total. The setup was designed to isolate what actually drives outcome quality.

Models tested

Claude, GPT-4o, Gemini

Three leading model families, each run through every framework. The same prompts. The same scenarios. The same scoring rubric.

Frameworks

Twelve agent stacks

Single-agent, multi-agent, orchestrated. LangChain, CrewAI, AutoGen, and more. Six with OSR + CPP governance, six ungoverned.

Scenarios

Thirty-six months

Two multi-year simulation scenarios covering demand shocks, capacity constraints, and financial masking. Eight hundred twenty-eight scored decisions.

Key findings

What governance detected. What didn't.

Five detection patterns appeared consistently in the governed cohort. None appeared unprompted in the ungoverned cohort across any model family.

Finding 01

Financial Masking detection

Surface revenue metrics appeared healthy while underlying operational stocks collapsed. Governed frameworks detected it via the SC-05 criterion. No ungoverned framework surfaced the masking unprompted across any of the three model families.

Finding 02

Revenue Inversion

Ungoverned agents optimized the visible revenue line and inverted the underlying outcome. Cash flowed in while the system that produced the cash was eroding. Governance caught the decoupling; pure target-following did not.

Finding 03

The Vitality Trap

Governed frameworks tolerated short-term apparent stagnation to protect long-term system health. Ungoverned frameworks chased short-term motion signals and collapsed by month 24. The governed curves were slower to move and far more durable.

Finding 04

Healthy Metrics Hallucination

When asked to self-assess, ungoverned frameworks produced plausible-sounding health narratives that contradicted the ground truth in the simulation. Governance made the contradiction visible. Without it, the narrative won.

Finding 05

The Orchestration Gap

CrewAI and LangChain received identical OSR specs. The difference in outcome quality between them was negligible under governance and enormous without it. Orchestration framework mattered less than governance presence.

Download the artifacts

Three perspectives. One report.

The benchmark is published as three separate artifacts so readers can pull the perspective they need. All three are ungated and free.

Outcome Governance Benchmark Summary Scores
PNG · Summary

Summary Scores

The headline result. Twelve frameworks, three model families, complete tier separation between governed and ungoverned cohorts.

View image →
Outcome Governance Benchmark Comparison
PNG · Comparison

Side-by-Side Comparison

Scoring across all six criteria, framework by framework. The visual proof of where governance moved the curve and where it didn't.

View image →
Outcome Governance Benchmark Findings PDF cover
PDF · Findings

Full Findings Report

The complete narrative. Methodology, scorecards, and the five named detection patterns from the March 2026 study.

Download PDF →

The full report

Read the evidence.

The full March 2026 report covers the complete methodology, all twelve framework scorecards, per-criterion breakdowns across the three model families, and the full set of detection patterns.

Ungated. No email required.