Get your framework independently benchmarked.
"Our agents are reliable" is something every framework vendor says. A przm receipt is the thing that makes it verifiable.
We run the benchmark independently against your release, sign the receipt with our Ed25519 key, and publish it on the public leaderboard. You get a third-party performance attestation you can link from your README, your sales deck, your landing page, or anywhere "this is what the test said" beats "trust us."
Pricing
Free signed receipt for the launch leaderboard, in exchange for case-study rights and a public quote.
- One full benchmark run on your release
- Ed25519-signed receipt with per-scenario transcripts
- Logo placement on the v0.1 launch leaderboard
- Charter-customer badge for your website / README
- Private findings brief before publication
- You provide: company name, 1-2 sentence quote, sample API key
Production cert for an individual framework or model release. 5-business-day turnaround.
- One full benchmark run on your release
- Ed25519-signed receipt with per-scenario transcripts
- Public leaderboard entry
- Private findings brief with the scenarios where you lost points
- 5 business day turnaround
Standard plus the holdout subset (your strongest defense against benchmark-gaming claims).
- Everything in Standard
- Runs against both seen + 20% holdout fixture set
- Seen-vs-holdout delta published with your leaderboard entry
- 72-hour priority turnaround
Custom fixture set authored against your stated use case. Private receipt unless you choose to publish.
- Everything in Extended
- Custom fixture set for your domain and failure modes
- Private receipt by default (your choice to publish)
- Re-run option if you ship a patch within 30 days
- 72-hour priority turnaround
Volume pricing available for framework vendors who certify each major release. Get in touch.
How it works
- 01Fill out the form below
Your framework name, the release version you want certified, which LLM model your framework will run. You get an automated ack within minutes; Matt follows up within one business day.
- 02Provide a sample API key
Enough to run the full fixture set (~$5–15 in API costs). We use your key so the run is billed to you and you can independently audit what was called.
- 03Adapter implementation
If you already have a PR against the przm-bench repo, we use that. If not, we implement a baseline adapter and you review it before we run.
- 04Run the bench
Temperature 0, seeded where the framework supports it. For non-deterministic frameworks: 3 runs, median per-axis score.
- 05Private findings brief
You see the numbers privately first. 48 hours to flag any adapter-implementation errors. We will fix genuine bugs; we will not re-run because the score was low.
- 06Sign + publish
Ed25519-signed receipt committed to the public ledger. You get the signed file to link from your own marketing.
What it doesn't include
That's the point. If you saw the test questions, the score would be meaningless. The holdout set is sealed from everyone vendor-side, including us.
The scenarios are designed to break things, not to flatter them. If your framework has a sycophancy problem, the receipt will say so.
A certification run is a certification run. If you want to improve your score, ship a better version and certify that release. Historical receipts stay in the ledger with their original scores.
"przm recommends this product" is not what a receipt says. A receipt says "this is what we found when we ran the test." Vendors with high scores AND low scores can publish; the meaning depends on context and the comparison.
Questions
Yes. LLM providers can certify a specific model version against the baseline adapter. Contact us for model-specific pricing.
The methodology is open source. Submit a PR. We take adversarial feedback seriously. The spec explicitly allows competitors to submit replacement confederate prompts and we publish both runs. If your objection is substantive, it makes the benchmark better.
Receipts pin the benchmark version. A convergence-v0.1 receipt is always comparable to other convergence-v0.1 receipts. When the benchmark version bumps, prior receipts stay in the ledger with their version label.
We don't sell the harness; we sell being the third party that ran the test. Same way a security auditor sells their name, not their checklist. The OSS is free. The signed-by-us receipt is the product.