COHORT 01 · DESIGN PARTNERS · Q3 '26 · forming

Quality needs a new interface.
Lead the release.

Release confidence for your CI. One ledger for humans and the agents you ship with.

Failure inbox: Every failure carries a verdict, confidence, and reasons[]. Never a black box.
Release confidence: One call, with the evidence behind it: should_block_release(). Override is one click.
Reads from your CI: GitHub Actions · Playwright · Cypress today. One YAML step. Built to absorb your full pipeline.
Agent-readable: Every screen has a JSON peek and an MCP tool. Humans, APIs, and agents read the same ledger.

On the name

Old word. A bellwether is the lead sheep that wears the bell, the one whose movement the flock reads to know which way the day is going. We took the name because that's the gap. Your CI emits a thousand signals an hour. Someone has to be the bell: the neutral, audible thing that says this is real, that isn't. Bellwether is that bell.

§ 02 · MANIFESTOTHE SHIFTBWX/01

02 · The shift→ a worldview, not a feature grid

Software changed. QA changed. The way teams decide what to ship hasn't.

01 · WAS

QA was a handoff.

A team owned testing. They wrote suites, gated releases, and the rest of engineering trusted the green check. The interface to quality was a person.

01 · WAS
QA was a handoff.
A team owned testing. They wrote suites, gated releases, and the rest of engineering trusted the green check. The interface to quality was a person.
qa-nightly·run #487 · 4m 12sqa-owned · single signal
- auth/login.spec.tsQA
- checkout/cart.spec.tsQA
- api/release/health.spec.tsQA
- ui/breaking.spec.tsQA
- data/migration.spec.tsQA
release.decision = "ship"— QA approved, handoff complete
02 · IS
QA is shared.
Agents write tests. SDETs review them. Engineers triage failures in Slack. Testing is faster and noisier; ownership is everywhere and nowhere. The handoff dissolved.
bel-22 / a1b2c3·run #1248 · live · 4m 12s6 owners · 3 reruns · 12 unread
- auth/login.spec.tsRM
- checkout/cart.spec.tsrerun #2A1
- api/release/should-block.spec.tsflake?RM
- ui/breaking.spec.tsA2
- data/migration-up.spec.tsrerun #3PA
- api/billing/edge.spec.tsinfra?A1
#release-cuts · 12 unread · 2 threads
PA: flake? rerunning
A2: blocking — fails on my branch too
RM: idk — anyone own this?
03 · NEEDS
QA needs a decision layer.
Reruns and gut feel don't scale to a thousand runs a day. What's missing is a neutral ledger that classifies every signal, names the owner, and tells you when to ship.
Bellwether is the bell.
bel-22 / a1b2c3·run #1248 · 4m 12s · classifiedledger · decision ready
- auth/login.spec.tsSTABLE0.99
- checkout/cart.spec.tsSTABLE0.97
- api/release/should-block.spec.tsREGRESSION0.92
- ui/breaking.spec.tsSTABLE0.95
- data/migration-up.spec.tsFLAKE0.76
REGRESSIONconf 0.92
api/release/should-block.spec.ts
RM · backend
→ BLOCK release
FLAKEconf 0.76
data/migration-up.spec.ts
PA · platform
→ SHIP (rerun confirmed)

§ 03 · DIAGNOSISTHE OLD WORKFLOWBWX/02

This is your quality decision layer today.

It's a Slack thread. Flaky test triage runs on tribal knowledge, "rerun it once," and the muscle memory of whoever happens to be online. CI failure triage is whatever the on-call engineer can remember at 4pm on a Friday. It works. Then your test count crosses a threshold and the noise becomes the signal.

# shop-ci · 14 members · release/2026.05

github14:02

CI run #2891 failed on release/2026.05 — 1 failed, 0 passed

priya a.14:04

is sso flaking again? auth › sso redirect

jordan m.14:05

probably. it was flaky last week too. retrying.

sam k.14:08

it passed on retry. should I just merge or wait

🤷 3👀 4

erin m. (eng lead)14:14

ship it. we’ll fix sso “next sprint.”

12 minutes · 5 humans · 1 unverified verdict · 0 audit trail

§ 04 · MECHANISMTHE DECISION LAYERBWX/03

04 · The new decision layer

The same failure, twelve seconds later.

Bellwether reads your CI events as they land. Every failure ships with a verdict (flake, regression, infra, or new), a confidence score, weighted reasons, a recommended action, and the likely owner. The event model is neutral by design: build, deploy, and security signals join the same ledger as we widen coverage.

BELLWETHER

triage/shop/web/fail_8820

classified · 320ms

auth › sso redirect with state

file e2e/auth/sso.spec.ts:42commit a3c91f4duration 14.3s · t/o 30srunner ubuntu-22.04 · gh-actions

VERDICT

likely flake

87% confidence

RECOMMENDED ACTION

Rerun once. If it fails again, quarantine and route to fe-platform.

recommended_action: rerun_once_then_quarantine_if_repeat

Reasons [ 4 ]

weights sum 0.87 · same as confidence

+0.34

Failed 7 of last 40 runs and passed immediately on retry 6 times

history
playwright

+0.28

No correlated code changes in touched files (auth/sso/*)

git-blame
github

+0.21

Timeout signature matches historical flaky cluster c-09 (waitForURL on /sso/callback)

fingerprint
playwright

+0.04

Concurrent runs against same SSO sandbox (3 parallel jobs in window)

environment
github

Same ledger · three surfaces

One classification, three calls: the human triage page, the REST API, and the MCP tool agents call. Same evidence, same confidence, same owner. No surface tells a different story.

HUMAN

bellwether.run/triage/fail_8820

The triage page. One verdict, weighted reasons, recommended action.

API

GET /v1/failures/fail_8820

Same JSON your dashboard reads. Hook it into Slack, Jira, your release bot.

AGENT · MCP

classify_failure(fail_8820)

Your release agent calls the same tool your engineers read. One ledger, both audiences.

§ 04b · TRUSTHOW WE WON'T LIE TO YOUBWX/03b

04b · Trust

No black-box verdicts.

Three things we wrote into the spec before we wrote a classifier. They are the difference between a confidence score you can trust and a number that gets ignored after the first wrong call.

COLD-START CAP

First 7 days, confidence is capped at 0.7.

On a new repo with no override history, every verdict ships with a low ceiling and a default of `unknown`. We earn calibration before we claim it.

INSUFFICIENT HISTORY

No history, no number. We return null.

If a test has run fewer than five times or the cluster lacks enough signal, the confidence field is `null`, not a fabricated low score. The product is honest about what it doesn't know.

FEEDBACK REQUIRED

Every verdict (UI, Slack, MCP) has a “was this right?” affordance.

Overrides are first-class events. The next classification on the same fingerprint reads your last override before the model fires. The learning loop is the moat; the affordance is non-optional.

Spec refs: CLASSIFY-05 · CLASSIFY-06 · LEARN-01 · LEARN-02. The learning loop is the moat. We built the affordance first.

§ 05 · POSTUREWORKS WITH YOUR STACKBWX/04

05 · Posture

The neutral layer. Not another silo.

Bellwether sits across your stack. We read CI events; we don't replace your test framework or your tracker. Coverage starts with test signals and absorbs build logs, deploy gates, and security scans as we widen.

GitHub Actions

CI events · YAML step

Playwright

Test runner · traces

Cypress

Test runner · runs

Slack

Failed-run summaries

Jira

Auto-draft tickets

Linear

Route to owners

What we read
test names · errors · stack frames · retries · durations · git SHAs · branches

What we never read
source code · env vars · secrets · user data inside fixtures · production traffic

● SOC 2 Type I · in flight● Self-hosted ledger · q4 ’26

§ 06 · PROOFDESIGN PARTNERSBWX/05

06 · Pilots

Three design partners. Twelve weeks. Numbers, not adjectives.

Targets · Cohort 01 · Q3 ’26 · not yet measured

−68%

TIME-TO-TRIAGE

Median minutes from CI failure to a verdict an engineer can act on. Cohort target: drop p50 from 21 → 7 minutes by week 12.

−41%

FALSE BLOCKERS

Releases held by flake-misclassified failures. Cohort target: prevent 41% of those holds without rolling back regressions.

3.2×

RELEASE DECISION SPEED

Median time from “RC built” to ship/block call. Cohort target: replace the meeting with a Slack click backed by a confidence score.

Cohort 01 starts Q3 ’26. We’ll publish weekly numbers (baselines and progress against these targets) once pilots are running. No retroactive revisions; the first reading is the first reading.

§ 06b · MEMBERSHIPTHE COHORTBWX/06b

06b · Cohort→ quarterly cadence, soft windows

Cohort 01 for Q3 '26.

A small first cohort. Real CI traffic. Direct line to the founders.

We onboard design partners one cohort per quarter. Cohort 01 is the first set of teams shipping on Bellwether. Their classifications, overrides, and owner-mappings shape V1.1. Their feedback is weighted higher in the learning loop than anyone who comes after, because they showed up first.

What you get

V1 access from day one
Co-design on V1.1
Founder Slack channel
Classification feedback weighted ×3 in the learning loop

What we ask

Real CI traffic, not a sandbox
One review call per cohort
Permission to learn from your labelled overrides

Lead the release.
Don't just run the tests.

Cohort 01 is forming for Q3 '26. Cohort 02 picks up Q4.

One repo, one CI, one Slack channel. Four minutes to set up.

No new dashboards to live in.

Talk to a founder Read the docs

no credit card · revocable token · works with what you already ship on

Quality needs a new interface.Lead the release.