second-witness — a due-diligence researcher for AI-tool decisions

90-second walkthrough

Run it on a number you already trust

You've quoted a SWE-bench score. Here's the second witness.

You paste:

We're choosing an AI coding agent. The model we're leaning on reports low-80s% on SWE-bench Verified — that's our main capability signal. Good enough to standardize on?

A summarizer repeats the score back. This returns:

Your decision rests on a number whose own author stopped trusting it. "Low-80s% on SWE-bench Verified" is tier-3 / contaminated: SWE-bench Verified is a static, public benchmark, and in 2026 that's the dominant failure mode — a high score can measure memorization, not capability.

Second witness: OpenAI announced in Feb 2026 it would stop reporting its scores on SWE-bench Verified, citing contamination and flawed test cases. The contamination-resistant sibling — SWE-bench Pro (Scale, private + held-out repos) — scores the same frontier models dramatically lower. That gap is the finding. The leaderboard score you're leaning on is the first witness contradicting itself.

verifiable — check the model's standing on a contamination-resistant board (SWE-bench Pro, LiveCodeBench), not the static one

biddable — capability on your code: a weekend eval on 20 real tasks from your own repo beats any public benchmark

unknowable — how it does on the messy multi-file work the benchmark never contained

Verdict: don't standardize on the Verified number — it's a witness that recanted. Run the 20-task eval on your own repo first; that's the only witness that settles your decision.

Every claim above is real and dated — sourced in reference/ai-specific-failure-modes-2026.md, not asserted. A tool that preaches source-rigor cites its own.

Built by

I'm Nicolás Patrón. I evaluate AI tooling for a consultancy and a family-holding AI build, and I've written the "should we adopt this" memo more than once — for vector DBs, RAG frameworks, model providers. I was wrong once, on a vendor benchmark that looked airtight until the production load test. That migration is why this folder exists.

It's the researcher I wish I'd had then. A folder of plain markdown — no install, no API key. You drop it into a Claude Project and Claude stops summarizing pitches and starts hunting witnesses.

What you get back

A map of certainty. Not a thumbs-up.

You paste:

Standardizing our retrieval on VectorDB-A. Site says sub-50ms p99 at 1M vectors. We filter heavily by tenant_id, loading ~8M, search path is a hard SLA.

A summarizer says "yes, sub-50ms is great at that scale." This returns:

The headline claim and the claim your decision rests on are not the same. "Sub-50ms p99 at 1M" is tier-3 — vendor sole witness — and it's at 1M while you're loading 8M, almost certainly unmeasured under tenant filtering, which is your condition.

Second witness: the classic ANN suite (ann-benchmarks.com) measures unfiltered search — wrong witness. The filtered suites (VDBBench, Qdrant's filtered-search benchmark) are the right place; the literature says filter selectivity, not raw count, dominates the tail. I found no independent reproduction at 8M with tenant filtering on hardware like yours — that absence is the finding.

verifiable — architecture supports filtered search

biddable — your real p99 at 8M+filter: a 1-week pilot on ~800k vectors settles it

unknowable — behavior under sustained write load at peak

Verdict: don't standardize on the vendor's 50ms. Run the filtered pilot first — days instead of a migration. Your decision rests almost entirely on the biddable claim.

That difference — between reorganizing the input and going to find what the input left out — is the whole assignment.

The method

Six moves on every claim.

interrogate  → what decision? what breaks? your conditions?
tier         → rank by witness (vendor = tier 3, sole, incentivized)
hunt         → go find the second witness; report gaps too
ground       → the variable that moves the number for you
bucket       → verifiable-now / biddable / unknowable
price        → confirm it · design the pilot · plan the abort

It tiers by witness, not by volume

Independent reproduction > peer-context paper > vendor docs > marketing > anecdote. The loudest source is not the highest tier. A claim with no witness but the vendor is held out of the verdict until a second one shows up.

It hunts the AI-specific failure modes first

Benchmark contamination (the dominant 2026 threat — a "SOTA" score on a leaked benchmark measures memory, not capability), effective-vs-nominal context, filter selectivity, and anything only an eval on your own data can settle. A generic vendor-eval skips exactly these.

It prices the unknowable instead of faking it

When a claim only resolves at 3am during a real outage, it says so and helps design the abort plan — rather than filling the gap with confident-sounding synthesis. Confidence never exceeds "moderate" on a vendor-sole-witness claim.

Getting started

Three paths.

A — 60-second cold test

In a Claude Project with the folder loaded, paste "Should we use [any AI tool you know]?" with no other context. If it asks what decision it's for instead of summarizing — it's working.

B — a real evaluation

Paste an actual claim you're weighing with your conditions (scale, constraints, what breaks if wrong). Get the bucketed map back. See JUDGE_GUIDE.md for five paste-ready tests (~10 min).

C — agent / CLI

The contract lives in rules.md; AGENTS.md points any agent runtime at the read order. Works with Claude, Codex, Cursor — any capable model that reads a folder.

What's in the folder

The folder is the program.

second-witness/
├─ identity.md      — who the researcher is, what decision it serves
├─ rules.md         — how it researches: the contract. Load-bearing.
├─ examples.md      — five worked exchanges; what good looks like
├─ reference/       — witness ladder, claims taxonomy, hype patterns,
│                     question bank, workflow, 2026 AI failure modes
├─ identity-examples/  — three real buyer profiles
├─ JUDGE_GUIDE.md   — five paste-and-go tests (~10 min)
└─ README.md        — onboarding in under two minutes

See the repo on GitHub

It informs a decision — it doesn't make it, and it's never a substitute for testing a tool on your own workload before you commit.

A vendor's claim and an independent benchmark aren't two data points to average. They're two witnesses with opposite incentives.