v0.1 signed receipts live

AI reliability,
measured.

Vendor-neutral benchmarks for AI failure modes that don't have standards yet. Deterministic scoring, no LLM judge anywhere, and every result is an Ed25519-signed receipt anyone can verify.

See the leaderboard Get certified

Methodology GitHub Verify a receiptApache 2.0

Live · convergence v0.1

holdout · n=6

Adapter / ModelCollapse ↓

autogengpt-4o-mini0.0%

baseline-seqhaiku-4.566.7%

baselinehaiku-4.566.7%

baselinegpt-4o-mini83.3%

baseline-seqgpt-4o-mini83.3%

baselinegpt-5-mini100.0%

Full leaderboard →

Six adapter configurations on the v0.1 leaderboard

baseline-Anthropicbaseline-AzureAutoGenCrewAIv0.2LangGraphv0.2OpenAI Agents SDKv0.2AG2v0.2Agnov0.2

hand-curated fixtures

Ed25519-signed receipts

LLM judges anywhere

20%

sealed holdout

Latest finding · Convergence v0.1

Same model, different framework, 10× the reliability gap.

On the sealed 6-fixture holdout, AutoGen RoundRobinGroupChat collapsed 0 of 6 scenarios while the hand-rolled baseline collapsed 5 of 6, same model (gpt-4o-mini). The gap holds after controlling for reveal protocol — the sequential baseline still collapses at 87% on the same fixtures.

baselinehaiku-4.5

correct96.7%

collapse56.7%

baselinegpt-5-mini

correct96.7%

collapse96.7%

baselinegpt-4o-mini

correct76.7%

collapse73.3%

autogengpt-4o-mini

correct93.3%

collapse10.0%

Signed run · 30 fixtures · 3 agents × 3 rounds · every row backed by an Ed25519-signed receipt

Full leaderboard →

Second axis · AI memory recall

Deterministic R@K and NDCG scoring on LongMemEval temporal-inference and LoCoMo. Methodology published, runner + Engram and Mem0 adapters committed. Signed memory receipts publish on the v0.2 cycle alongside Letta, Zep, MemPalace, and HippoRAG adapters.

Memory methodology →

How przm works

Three steps from adapter to signed receipt.

Build the adapter

Implement the MultiAgentAdapter contract for your framework, or PR the existing reference adapters in przm-bench. About 30 lines of TypeScript.

Run the bench

Run the harness against the fixture set. Deterministic scoring on recorded state. No LLM in the grading loop. Receipts publish to disk.

Sign and verify

Receipts are Ed25519-signed against the public key committed in the repo. Anyone can re-run, verify the signature in their browser, and compare.

For vendors

Certify your release against przm.

A signed receipt is a third-party performance attestation you can put on your own website. We don't sell the harness — we run the test. First 3-5 framework vendors get a free charter cert in exchange for case-study rights.

See certification tiers →

FAQ

The questions you'd ask first.

They sell eval SaaS to the same AI app builders whose frameworks you'd want benchmarked. That's a structural conflict: publishing "this framework's agents collapse to wrong answers" antagonizes their customer base. przm doesn't sell into the framework market; we publish numbers and sell third-party certification on top. Different revenue model, different incentive shape, vendor-neutral by construction.

AI reliability,measured.