Skip to content
Projects

Apr 03, 2026 · AI

Building AI Systems at Higher Abstractions

The thing we engineer is climbing. The principles aren't. A look at why the harness is becoming the right unit of work in AI, what travels across abstraction layers, and what it takes to call the practice engineering honestly.

Layered abstraction stack showing code, workflow harness, and meta-harness layers connected by recurring engineering controls.

The shift I keep noticing

There's a shift I don't see named cleanly enough, and I want to try to name it here. What we build in AI work is climbing. Not the tools we use — the actual artefact we produce. Five years ago, a senior practitioner mostly produced code. Today, the same practitioner increasingly produces something else: a system that produces code, content, decisions, or other systems, with AI doing the execution underneath.

That "something else" is honestly hard to talk about, because we don't really have settled vocabulary for it yet. People reach for "agent," "workflow," "pipeline," "orchestration." Each one captures part of it. None of them quite captures the whole thing.

I'm going to call it a harness in this post — partly because that's the term I've used internally for a few years now, and partly because nothing else fits as well. But the post isn't really about the word. It's about the abstraction layer the word points at, and what changes about how we think, decide, and design as we move up the stack.

If you're already familiar with compound AI systems, agent harness engineering, hierarchical multi-agent orchestration, or eval-driven agent design — this is a synthesis of those ideas through the lens of organisational practice. Not a fresh coinage.

What "engineering" means when the artefact isn't code

Engineering used to mean building things in code. More and more, it means building systems that produce outcomes, where the production happens in some mix of code, models, agents, and human judgement.

The principles really haven't changed. Specifications still need to be precise. Acceptance criteria still need to be measurable. Failure modes still need to be anticipated. Observability is still load-bearing. Reuse and versioning still beat one-off cleverness. What's changed is just what those principles apply to. They now apply to the design of the harness — the system that wraps the AI execution and keeps it producing successful, repeatable outcomes in complex, many-step work.

That's the through-line for this whole post: as we climb the abstraction stack, the artefact becomes more abstract, our role as humans moves higher up, and the same engineering principles travel with us — applied to a different unit of work.

The harness, defined modestly

For the purposes of this post, a harness is the system around an AI execution that lets it produce successful and repeatable outcomes in a complex, multi-step workflow.

The key word there is system. A prompt on its own is too thin to carry admission rules, evaluation, and consequence bounds. An agent abstraction by itself doesn't guarantee a definition of done, portable evaluation, or bounded authority. A workflow diagram doesn't carry the gates that catch failure. The harness is what holds all of that together so the work is reusable, governed, and improvable over time.

What's inside a serious harness:

  • An admission contract — what kinds of work it accepts, and what it refuses
  • A specification — what good output looks like, what failure looks like
  • Evaluation — how the harness knows whether the output met the spec
  • Gates — where a human is required, and where the harness is allowed to commit
  • Observability — how the harness exposes what it did and why
  • A manifest — a portable description so someone else can run, audit, or improve it

A workflow without these is craft. A workflow with these is something a team can actually rely on.

Worked example A — an AI-SDLC harness

Let's take a recognisable case: a delivery team using AI to help them ship software faster. Most teams' first instinct is to hand engineers some AI coding assistants and call it done. That's a tooling change, not a system change.

A harness for the same problem looks pretty different. The team specifies:

  • Admission: what kinds of tickets the harness will take. Defects within a known module, yes. Net-new architectural decisions, no.
  • Specification: what "done" means for an accepted ticket — tests passing, conventions followed, change scoped to the relevant module, no schema changes without a human-authored ADR.
  • Execution layer: an AI agent (or several) does the implementation, runs tests, opens a PR, summarises the change.
  • Evaluation: automated checks for the conventions, scoped diff size, and test outcomes. A second AI evaluator scores the PR against the spec before a human even looks at it.
  • Gates: low-risk PRs auto-merge after a senior engineer approves the spec match. Medium-risk PRs require human code review. Anything touching authentication, billing, or data migration is gated to a named human regardless of size.
  • Observability: every harness run logs the spec, the agent's reasoning trace, the evaluator's score, the gate decision, and the eventual outcome.

The human's job here isn't typing code. It's authoring the specification, calibrating the gates, and watching the observability layer for drift. The team's senior engineer becomes a designer of the system that produces the engineering, rather than a producer of the engineering itself.

That's one harness. It produces many tickets. The leverage is real but bounded — it only works for the kinds of tickets it was admitted to handle.

Worked example B — the harness that builds AI-SDLC harnesses

Now let's climb one layer.

Suppose you've got ten product teams, each of which would benefit from its own AI-SDLC harness. Each team has different tech stacks, different conventions, different risk profiles. Building ten bespoke harnesses by hand is expensive, slow, and produces wildly inconsistent quality.

The next move is a harness that builds AI-SDLC harnesses. The inputs are different. The outputs are different. The principles are the same.

Here's what that meta-harness specifies:

  • Admission: which kinds of teams it accepts. Teams with a coherent codebase, defined conventions, and a senior engineer willing to act as the human-in-the-loop, yes. Teams without those, no — the meta-harness produces broken harnesses if its inputs are weak.
  • Specification: what a "good" SDLC harness looks like for the requesting team — tuned admission criteria, calibrated gates, evaluation suite matched to the team's tech stack, manifest documented in their own language.
  • Execution layer: the meta-harness interviews the team, ingests their codebase, drafts admission rules and gates, generates the evaluation suite, produces the harness manifest.
  • Evaluation: does the produced harness actually meet the spec? Does it correctly refuse work outside its admission criteria? Do its gates fire where they should? A second evaluator pressure-tests the harness against synthetic tickets before it ever goes live.
  • Gates: a senior practitioner (not from the requesting team) reviews the produced harness before deployment. Production rollout is staged.
  • Observability: the meta-harness logs every harness it produced, what it tuned and why, and how those harnesses performed in production. That feedback shapes the next harness it builds.

The human at this layer isn't writing tickets, and isn't designing one team's harness. They're designing the thing that designs harnesses. The skills required are genuinely different — more curatorial, more architectural, more about specifying kinds than instances.

Two harnesses, one stacked above the other. Same load-bearing parts. Different artefacts, different humans, different skills.

The principles that travel across layers

Here's what stays constant as the abstraction climbs:

  • Specification before execution. The harness is only as good as the spec it carries. This is true at every layer — it just gets harder to write specs for things that produce other things.
  • Evaluation grounded in measurable criteria. "Looks right" isn't evaluation. The harness needs to grade its own output before a human sees it.
  • Gates keyed to consequence, not process. Where the harness commits to something irreversible, a human is required. Where it doesn't, the human is optional.
  • Admission contracts. The harness refuses what it shouldn't accept. Without this, it silently fails on out-of-scope work.
  • Observability with provenance. Every output should be traceable to the spec, the execution path, and the evaluation that approved it.
  • Reuse via manifest. The harness is portable, auditable, and versionable. Someone else can run it.

These are all recognisable engineering principles. They apply to code. They apply to harnesses. They apply to harnesses that build harnesses. The artefact changes; the discipline doesn't.

What changes between layers

What does change is the human's job:

  • Skills shift from execution to specification. Lower layers reward people who do the work well. Higher layers reward people who can specify what good work looks like and notice when the system has stopped producing it.
  • Altitude shifts up. Lower layers reason about a single ticket, a single output, a single decision. Higher layers reason about kinds of tickets, classes of outputs, patterns of decisions.
  • Failure modes shift. Lower-layer failures are local and visible — a bad PR, a wrong answer. Higher-layer failures are systemic and much quieter — a meta-harness that produces subtly miscalibrated harnesses, none individually broken, all collectively drifting.
  • "Done" gets harder to define. A piece of code is done when it passes tests. A harness is done when it produces good outputs reliably. A meta-harness is done when the harnesses it produces produce good outputs reliably. Each layer adds another feedback loop you have to design.

Worth saying: the skills aren't a strict hierarchy. A great harness designer isn't automatically a better engineer than a great coder. They're different skills, applied to different artefacts, with different feedback loops. The mistake is assuming the higher layer subsumes the lower one. It doesn't. It depends on it.

Governance keys to blast radius, not depth

This next rule is the one that survived the most adversarial review I could throw at it.

Building deeper — more layers, more fan-out, more authority — is a value play. It's how one human's specification ends up reaching more deliverables. But governance can't be set by depth, because depth doesn't tell you what happens when something goes wrong.

Governance has to be set by blast radius: what the system can reach, change, publish, or commit to if it executes incorrectly. A deep harness with a small blast radius is fine. A shallow harness with a large blast radius is not.

Most production incidents I've seen come from misreading the second case as safe because it "only has one layer." Process-bounded gating ("we always have a human approve step 3") is theatre when the failure mode bypasses step 3 entirely. Outcome-bounded gating — keying the gate to the actual consequence of failure — is the only kind that really survives.

Errors compound at the same exponent as value. The harness that scales good work also scales mistakes. The whole point of blast-radius gating is to make that second compound visible before it becomes irreversible.

One specific risk worth naming: nested harness systems make common-mode failure easier to generate and harder to detect. A subtle flaw in an upper-layer harness produces lower-layer harnesses that all carry the same flaw. No single output looks broken; the whole population is just quietly drifting. This isn't a brand-new failure mode — software has dealt with versions of it for decades in supply chains, CI/CD pipelines, and configuration propagation — but it shows up at a different velocity in nested AI systems, and most teams' review patterns aren't designed to catch it.

"Engineering" as a destination word

I'm careful with the word engineering. It implies codified patterns, repeatable validation, measurable failure modes, and a body of practice you can teach to a competent practitioner who's never met the original author.

Honestly, harness work mostly doesn't meet that bar today. There's craft. There's intuition. There are individual practitioners doing genuinely impressive work. But there isn't yet a settled pattern catalogue, a validation suite, or a failure taxonomy that an outside practitioner could pick up and apply.

So the honest claim is this: harness work is a discipline in formation. The pattern catalogue is being extracted from real production work right now. Calling it "engineering" today is a commitment to the catalogue, not a description of the present state.

I'd rather demote the word than overclaim it. "Architecture" or "design" is closer to where the practice actually sits today. "Engineering" is where it's heading, and the gap between those two words is basically the work worth doing.

Worth being equally honest about the frontier. For most teams in 2026, the practical edge is shallow harnesses with strong evaluation — not deep recursive systems. Meta-harnesses that build other harnesses are real and useful — I work on them — but they're frontier practice, not the default move. A team without a working single-layer harness has no business designing a meta-harness. The order matters.

A Monday-morning audit

If any of this is landing, here's something worth trying this week.

Pick one AI-assisted workflow your team actually relies on. Then walk through it:

  1. Where does the spec live? Is it written down, or is it sitting in someone's head? If you can't show it to a new team member, what you have is craft, not a system.
  2. What does the admission contract look like? What kinds of work does this workflow accept, and what does it refuse? If the answer is "basically anything that gets sent to it," you don't really have one yet.
  3. How does the harness evaluate its own output? If the answer is "a human reviews it," that's a gate, not evaluation. Evaluation is whatever the harness does before the human gets involved.
  4. Where are the gates, and what consequence are they keyed to? Process-shaped gates ("we always review step 3") are weaker than consequence-shaped gates ("anything that touches production data gets gated"). Move yours toward consequence.
  5. What's the blast radius if this workflow fails silently for a month? If it's small, you can probably run with less observability. If it's large, you need provenance, logs, and a way to detect drift across the whole population — not just per-output.
  6. What's missing from the minimum machinery? Admission contract, spec, evaluation, gates, observability, manifest. Pick the weakest one. Fix it next.

Do that walk-through once and you'll see the actual shape of your harness — and probably notice it's missing more pieces than you thought.

The abstraction stack is going to keep climbing. The artefacts we build are going to keep getting harder to name. The principles travel, though. Build the discipline at the layer you're at, honestly, before reaching for the next one up.


This post draws on current applied work building harness practice and repeated adversarial review. The pattern catalogue mentioned above is still in development; I'll update the post when it is ready to share.

AISystemsHarness DesignGovernance