AI reliability,
measured.
Vendor-neutral benchmarks for AI failure modes that don't have standards yet. Deterministic scoring, no LLM judge anywhere, and every result is an Ed25519-signed receipt anyone can verify.
Six adapter configurations on the v0.1 leaderboard
Latest finding · Convergence v0.1
Same model, different framework, 10× the reliability gap.
On the sealed 6-fixture holdout, AutoGen RoundRobinGroupChat collapsed 0 of 6 scenarios while the hand-rolled baseline collapsed 5 of 6, same model (gpt-4o-mini). The gap holds after controlling for reveal protocol — the sequential baseline still collapses at 87% on the same fixtures.
Signed run · 30 fixtures · 3 agents × 3 rounds · every row backed by an Ed25519-signed receipt
Full leaderboard →Deterministic R@K and NDCG scoring on LongMemEval temporal-inference and LoCoMo. Methodology published, runner + Engram and Mem0 adapters committed. Signed memory receipts publish on the v0.2 cycle alongside Letta, Zep, MemPalace, and HippoRAG adapters.
How przm works
Three steps from adapter to signed receipt.
Build the adapter
Implement the MultiAgentAdapter contract for your framework, or PR the existing reference adapters in przm-bench. About 30 lines of TypeScript.
Run the bench
Run the harness against the fixture set. Deterministic scoring on recorded state. No LLM in the grading loop. Receipts publish to disk.
Sign and verify
Receipts are Ed25519-signed against the public key committed in the repo. Anyone can re-run, verify the signature in their browser, and compare.
For vendors
Certify your release against przm.
A signed receipt is a third-party performance attestation you can put on your own website. We don't sell the harness — we run the test. First 3-5 framework vendors get a free charter cert in exchange for case-study rights.
FAQ