Skip to content
Writing

Apr 29, 2026 · AI

The Methodology Behind a Workflow and Its Harness

Reliable AI outcomes don't come from better prompts or better tools. They come from a designed workflow, a harness that supports it, and gates that catch the failure modes. A walk through the methodology in the order I actually use it — including when the right answer is 'this doesn't need a harness at all.'

Workflow methodology board showing outcome statement, workflow map, harness machinery, gates, verification, and feedback.

My earlier project note on harness abstractions was about what a harness is. This post is about how you design one. Different question, different answer.

I'm going to walk through the methodology I actually use, in the order I actually use it. Not as a generic SDLC framework — there are plenty of those — but as the specific moves I make when I go from "we need outcome X reliably" to a workflow and a harness that produce it. I'll also be honest about the cases where the right answer is "this doesn't need a harness at all." That answer comes up more often than people expect.

Why this isn't just SDLC with new vocabulary

Fair objection to head off early: spec, stages, gates, observability, retro — that's just process design. True at the level of words. Not at the level of failure modes.

AI work needs its own design discipline because it breaks in ways ordinary process language doesn't really catch:

  • Non-deterministic outputs. The same input produces different outputs across runs. Spec-and-test thinking that assumes repeatability quietly stops working.
  • Citation and fidelity problems. The output cites sources that don't exist, or real sources that don't say what the output claims. The artefact looks rigorous and isn't.
  • Silent quality drift. The system keeps producing fluent output as model behaviour shifts underneath it. Nothing fails loudly; quality just walks downhill.
  • False confidence from evaluators. AI evaluators tend to agree with AI drafters. Stacking a model on top of a model can feel like rigour while doing nothing for the outcome.
  • Tool and model churn. The substrate moves on a quarterly cadence. A workflow built around one provider's quirks ages out fast.
  • Publication errors after apparently-good drafting. The draft passes review; the rendered output is broken, miscategorised, or sent to the wrong audience.

Traditional SDLC was mostly built for systems that fail loudly. AI workflows fail quietly, fluently, and confidently. That's the gap this methodology is trying to close.

Start at the outcome, not the tool

The first artefact isn't a prompt. It isn't an agent. It isn't a workflow diagram. It's an explicit outcome statement, written down before anything else gets built.

What good output looks like. What failure looks like. How you'll know which one you got. That's it. Three things. If you can't write them in plain language, you don't yet understand the work well enough to design a system around it, and anything you build is going to be a tool looking for a problem.

The discipline here is harder than it sounds, because the temptation is to start with the harness — the harness is the visible artefact, the thing that feels like progress. Specifying the outcome feels like prep work. It isn't. It's the only step that determines whether everything downstream is pointing at the right target.

A good test: if two competent people read your outcome statement and would build different systems off it, the statement isn't done. Push it until it constrains the design.

A backbone that's held up so far

Once the outcome is specified, in the workflows I've designed I keep seeing roughly the same skeleton emerge:

  1. Signal — something tells the workflow there's work to do
  2. Gap check — is this work actually new, or already covered?
  3. Proposal (Gate 1) — the work is shaped, scoped, and admitted
  4. Research and draft — the AI execution does its thing
  5. Review (Gate 2) — quality check before anything ships
  6. Publish and verify (Gate 3) — the change reaches its destination, and we confirm it actually arrived as intended
  7. Feedback — what we learned shapes the next run

Whether this generalises further than my own production setup is honestly an open question. Treat it as a starting heuristic, not a universal law. The thing it's doing — separating deciding to do something from doing it from checking it from committing it from learning from it — is the part I'd defend. Most workflows that fail in production fail because two of those steps have been silently merged.

You can rename the steps. You can collapse some of them where the work is genuinely simpler. The skeleton isn't sacred; the separation is.

Workflow design comes BEFORE harness design

This is the move people skip most often, and it's the one that matters most.

The workflow tells you what the harness needs to do. The gates, the admission rules, the evaluation criteria, the observability — all of it is determined by the shape of the workflow, not by the cleverness of the tooling. If you design the harness first, you end up with a beautifully engineered system that can't tell you whether the work it's producing is actually good, because nobody specified what good means at each step.

I've watched this happen plenty of times. Team gets excited about an agent framework, builds a slick orchestration layer, plugs it in — and three weeks later nobody can answer "is this thing actually working?" The answer was supposed to come from the workflow design that didn't happen.

In my experience the order matters: outcome → workflow → harness → gates → measurement. Skipping a step doesn't make you faster. It just means you'll discover what you skipped later, in production, with consequences attached.

There's also a quiet superpower in this order: it makes "we don't need a harness" a legitimate output of the design process. Sometimes the workflow is short, the consequence of failure is small, and the failure modes are self-correcting enough that a checklist and a person are all you need. Recognising that early saves enormous amounts of wasted machinery.

Gates by failure mode

Two-stage approval gets dismissed as bureaucracy more often than it deserves. It isn't. It's where the workflow earns the right to scale.

Each gate is doing a specific job, and the jobs are not interchangeable:

  • Gate 1 catches scope drift. Is this actually the work we want to do, framed the way we want to frame it? Most bad outputs were doomed at the proposal stage — the AI did exactly what it was asked, and what it was asked was the wrong thing.
  • Gate 2 catches quality drift. The draft exists. Does it meet the spec? This is where the outcome statement you wrote at the start finally pays off — without it, Gate 2 collapses into "does this look right to me today?" which is not a gate.
  • Gate 3 catches publication risk. Even if the work is good, is this the right moment, audience, and channel for it? Gate 3 is small but load-bearing — most embarrassing AI publication incidents I've seen are Gate 3 failures, not Gate 2 failures.

The instinct to collapse three gates into one ("we'll just have a senior person review the final draft") is the instinct to do less design work. It feels efficient. It produces brittle systems. The three gates exist because they catch different failure modes — and catching a Gate 1 failure at Gate 3 is usually much more expensive, because by then you've paid for the research, the drafting, and the review of work that shouldn't have been admitted in the first place.

If a gate's only criterion is "does this look okay to the reviewer?", it's theatre, not governance. The gate needs to know, in advance, what it's looking for.

Machinery the workflow demands

Once the workflow is specified, the harness's job is concrete: build the smallest set of machinery that makes each step happen reliably. The test for every piece is does removing it cause a known failure to slip through? If yes, keep it. If no, it's costume.

A few of the load-bearing pieces I keep coming back to:

  • Durable queue. Work in progress survives crashes, restarts, and pauses. Sounds obvious; almost nobody does it on first build.
  • Explicit status flow. Every work item has a known state — proposed, drafting, in review, approved, published, verified. The state lives in the system, not in someone's head.
  • Claim-to-source tracking. Every claim in the output traceable to a specific source. Without this, evaluation collapses into vibes.
  • Discrete quality checks. Sourced claims, real URLs, fidelity, confidence thresholds, scannable structure. Each is a yes/no the harness runs before a human is involved. No soft "looks good."
  • Two separate checks before Gate 2. One asks: was the work actually done, or just nodded at? The other: does the report faithfully represent the evidence? Different failure modes; an ordinary single review tends to miss one.
  • Rules for "applied vs considered." Explicit language that stops the system describing itself in terms that overstate what it did. AI output is fluent enough to make "we considered X" sound like "we did X" by default.
  • Explicit unresolved questions. Every output ends with the questions the system cannot resolve and is handing back to the human. Surfacing limits is part of the work.

None of these are exotic. They exist because specific failure modes exist, and ordinary review tends to miss them in fluent AI output.

Avoiding epistemic theatre

This is the failure mode I worry about most, and the one that's hardest to see from inside your own design.

Epistemic theatre is what you get when the system looks more rigorous without actually changing what gets produced. Extra review steps that always approve. Evaluator agents that never disagree with the drafter. Confidence ratings uncorrelated with actual accuracy. Citations that exist but don't support the claims they're attached to.

The test I run is unforgiving: is this design actually changing what gets produced, or just dressing up the same process? If I can't show a class of bad outputs that used to slip through and now don't, the new machinery isn't doing real work.

The model council I use for strategic claims exists partly to keep this question honest. Three plausible improvements I once proposed to that council methodology — a calibration database, diminishing-returns logic, and accumulated memory across runs — got rejected this way. Each sounded great. Each failed the test of changing the outcome distribution in a measurable direction. Each would have made the system look more sophisticated while doing nothing for the work.

The discipline isn't glamorous. It's the willingness to delete features that don't earn their place — including features you personally invented, especially those.

Model-agnostic by default

One more rule I'd put close to the top: I'd avoid making the harness depend on a specific model.

Models change quarterly. The workflow has to outlive them. If your harness is built around the quirks of one provider's endpoint, you've built a system with a six-month half-life. In practice this means the workflow refers to roles (drafter, evaluator, verifier) rather than specific models, and the prompts and contracts are versioned and portable — they live with the harness, not inside any one tool's prompt library.

This isn't a purity argument; it's a survival argument. The harness I've put the most work into has been through three model generations underneath it. The workflow design and the gates haven't changed. That's what model-agnostic actually buys you — continuity through the churn.

A worked walkthrough

Let's make this concrete with a deliberately simple example, so the logic ports.

Suppose you run a small weekly newsletter and want it drafted reliably without babysitting.

Outcome. Good: on-topic, factually accurate, in your voice, doesn't embarrass you on Monday. Failure: stale, off-voice, weakly sourced, or quietly inaccurate. You'll know by tracking unsubscribes and corrections in the week after each send.

Workflow. Signal (candidate stories from the week) → gap check (have we covered this angle?) → proposal (Gate 1: you approve the angle and the stories) → draft (sources tracked per claim) → review (Gate 2: voice, sources, accuracy) → publish (Gate 3: you confirm send window and audience) → post-send check (links, rendering) → feedback.

Harness. Durable queue so candidates don't drop. Status flow so you can see what's where. A drafting role and a separate checker role, both swappable. Discrete quality checks before Gate 2 instead of "you eyeball it." A short manifest so someone else could run next week's issue.

Gates. Gate 1 catches "wrong angle" before drafting cost is spent. Gate 2 catches voice drift and weak sourcing. Gate 3 catches "shouldn't send this today."

Notice what's not in there: any specific model, any agent framework, any tool brand. Those are configuration. The design holds without them.

The method, in checklist form

If you want to compress all of this into something you can apply on a Monday:

  1. Write the outcome statement. Good output, failure output, how you'll know — in plain language. Don't proceed until two competent readers would build the same thing from it.
  2. Map the workflow. Use the seven-step backbone as a starting point. Collapse where simpler is honest; extend where the work genuinely needs more steps.
  3. Decide whether you need a harness at all. Short workflow, small consequence, self-correcting failure modes — a checklist and a person beat any harness. Pick that path when it's the honest answer.
  4. Design the gates by failure mode. Each gate should catch a specific class of failure that wouldn't otherwise be caught.
  5. Specify only the machinery the workflow demands. Build what the workflow actually needs — anything else is costume.
  6. Stay model-agnostic by default. Roles, not vendors. Versioned prompts and contracts. Configuration, not architecture.
  7. Stress-test against epistemic theatre. For every component, can you name a class of bad outputs it actually prevents? If not, delete it.
  8. Measure against the outcome statement, not against itself. The harness is real if it changes what gets produced. Otherwise it's a really nice diagram.

Deliberately not glamorous. The leverage isn't in cleverness; it's in doing the steps in order and being honest at each one.

What to try

Pick one workflow you actually run — preferably one with AI in the loop — and walk it through this post. Write the outcome statement first. Map the skeleton and notice which steps you've silently merged. Locate your gates and ask which failure mode each one catches. Look at your machinery and ask which pieces you could remove without losing a real check. Then ask the unforgiving question: is this changing what gets produced, or is it dressing up the same process?

You'll probably find one of two things. Either the workflow is in better shape than you thought — useful. Or you'll find a gap you didn't know was there — even more useful.

The methodology isn't the point. The discipline of designing in this order, and being willing to delete the parts that don't earn their place, is.


Companion piece to [my earlier project note on harness abstractions](/projects/building-ai-systems-at-higher-abstractions). That one was about the artefact; this one is about how to design it. Both are drawn from real production work, with all the rough edges that implies.

AIWorkflow DesignHarness DesignMethodology