From Prompts to Harnesses
The way I work with code has changed faster than I have found words for it.
A year ago, the skill everyone talked about was prompt engineering — how to phrase a question so the model would give you something useful. Six months later, that shifted to context engineering — how to manage what the model knows, what it sees, what it remembers. Now the conversation has moved again, to harness engineering — how to build the system around the model so it actually works reliably.
Three names in roughly eighteen months. That is not just vocabulary churn. Each new name caught on because the previous one stopped explaining the part that mattered most. And I think the progression itself tells us something important about where AI tooling is heading and what the work of building software is becoming.
Each stage is a discovery about where the leverage actually is
The interesting thing about these transitions is not the definitions. It is what each one revealed about the limits of the stage before it.
Prompt engineering assumed the model was smart enough — you just needed to unlock it with the right words. The hard problem was communication. If you could phrase your intent precisely, the model would deliver. And for a while, that was roughly true. Simple tasks — generate this text, summarize that document, translate this paragraph — responded well to better prompts. The lever was language.
Then people started building more ambitious things: multi-step workflows, retrieval-augmented systems, agents that used tools. And prompt engineering hit a wall. It did not matter how well you phrased a question if the model was reasoning over the wrong information. The bottleneck was not communication. It was what the model could see.
That is what context engineering made explicit. The lever moved from “say the right thing” to “show the right thing.” And the key insight was subtler than it sounds: context engineering is curation, not accumulation. Models have finite attention budgets. More context often makes things worse — accuracy drops as input length grows, important details get lost in the middle. The real discipline was deciding what to show, what to hide, what to summarize, and what to fetch on demand. Information architecture, not “give it more background.”
Context engineering was a meaningful upgrade. But it was still fundamentally about optimizing a single inference — making one model call as good as possible. And once people started running agents at scale, a different wall appeared. You could feed a model perfect context and get a great result ninety percent of the time. But ninety percent reliability across a thousand actions means a hundred failures. Perfect context for one inference does not guarantee reliability across thousands.
The bottleneck was not the information. It was the system.
That is the transition we are living through now. Each stage’s practitioners believed they were solving the whole problem. Each transition revealed it was a subproblem.
If the pattern feels familiar, it should. Software engineering has been climbing the same ladder for decades — machine code to assembly to high-level languages to frameworks to declarative systems. Each step moved the developer further from the machine and closer to intent. The AI evolution is replaying that arc, compressed into months instead of decades.
Harness engineering is a fundamentally different kind of problem
The reason harness engineering feels different from what came before is not just that it operates at a higher level. It is that the nature of the work changes.
Prompt engineering and context engineering are both about optimizing a single interaction. You craft the input, you evaluate the output, you adjust. The feedback loop is immediate — seconds, maybe minutes. The unit of work is one inference. This is craft. You are making individual artifacts well.
Harness engineering operates on a different timescale and a different unit. You design a system, run it across hundreds or thousands of agent actions, observe aggregate behavior, adjust constraints, and repeat. The question stops being “was this output good?” and starts becoming “does this system produce acceptable outputs reliably over time?”
That is the difference between a potter making one excellent bowl and an engineer designing a factory that produces ten thousand acceptable bowls. The skills are related, but the disciplines are not the same.
And most of harness engineering is about what happens after the model produces output — not before. That distinction matters because it is easy to confuse harness engineering with context engineering. Context engineering is about making things clearer for the AI. Harness engineering is mostly about what you do with the AI’s output: verification steps, guardrails that prevent dangerous actions, automated validation, feedback loops that catch a category of failure and prevent it from recurring. You stop fixing individual outputs and start fixing the system.
This is what made OpenAI’s Codex story interesting when it came out earlier this year. Their team built roughly one million lines of production code over five months with zero manually written code — about 1,500 merged pull requests from a team of seven engineers. The breakthrough was not the model. It was the harness. Custom linters caught structural drift. An AGENTS.md file gave agents a table of contents for the codebase. Structural tests enforced architectural constraints without manual review. Verification loops validated changes before they merged.
No single output had to be perfect. The harness turned unreliable parts into a reliable whole.
Mitchell Hashimoto, who co-founded HashiCorp, named this stage in a way that stuck: “Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.” That is not prompt optimization. That is systems engineering. That is building for expected failure.
The roles literally inverted
Once you internalize that harness engineering is about building for expected failure, something uncomfortable comes into focus about what the developer’s job has actually become.
In traditional software engineering, the relationship between human and machine is clear. The human authors the code. The machine verifies correctness — through tests, type checkers, linters. You write a specification, implement it, and run automated checks to confirm the implementation meets the spec. If the tests pass, the code is correct. Authorship is human. Verification is mechanical.
Harness engineering inverts that.
The machine does the authoring. It writes the code, generates the output, produces the pull request. The human designs the quality constraints — the guardrails, the validation boundaries, the acceptability criteria. You cannot test for correctness in the traditional sense because the output is stochastic. There is no deterministic specification for what the model will produce. Instead, you test for acceptability boundaries. Outputs must stay within constraints, must pass validations, must not violate safety conditions.
The developer used to be the author of the code. Now the developer is the editor of the system’s output.
That is more specific and more disorienting than the generic observation that “the developer’s role is changing.” The thing most developers were best at — writing code — got automated. The thing they now do — quality engineering of stochastic systems — did not exist as a discipline eighteen months ago.
I notice this in my own work. The productive sessions are not the ones where the model gets everything right on the first try. They are the ones where the system around the model makes it easy to recover when it does not. That shift — from authoring to editing, from writing to verifying — is already measurable. Claude Code now authors roughly four percent of all public GitHub commits, about 135,000 per day. The inversion is already the default at scale.
What comes after harnesses
If the pattern holds, something will eventually do the same to harnesses.
I think there are two candidates for the next bottleneck.
The first is emergent behavior in multi-agent systems. A single agent with a well-designed harness is predictable. Multiple agents with well-designed harnesses interacting with each other can produce emergent behavior that none of the individual harnesses anticipated. This is the distributed systems problem — individual components can each be correct while the system as a whole fails in unexpected ways. We have faced that problem before in software engineering and partially solved it with consensus protocols, event sourcing, and saga patterns. We will need analogous patterns for multi-agent AI systems.
Anthropic is already building in this direction — Claude Opus 4.6 shipped with “agent teams” that split tasks and coordinate in parallel, and the Model Context Protocol has become the standard interface for connecting agents to external tools. The infrastructure for multi-agent coordination is arriving before the discipline for managing it.
The second candidate is governance. At some point, the harness gets reliable enough that the question stops being technical — “can we make this work?” — and becomes institutional — “how much should we let this run without human oversight?” That is not an engineering question. It is a design question about the boundary between human and machine authority. And it follows a historical pattern: every time we automated a meaningful category of human work — manufacturing, transportation, finance — we eventually had to build regulatory and governance frameworks around the automation. AI agent systems are on the same trajectory. The harness is the technical layer. Governance will be the institutional layer above it.
Both candidates point to the same thing. The bottleneck is moving from technical systems to sociotechnical systems. The next era probably will not be solved by engineers alone.
I do not know which one arrives first, or whether it will be something else entirely that nobody is naming yet. But I have noticed that each stage only becomes visible in retrospect. Whatever comes after harnesses will probably make harness engineering look the way prompt engineering looks now — necessary, but insufficient. A subproblem we mistook for the whole thing.
That seems to be how this works.