we keep ✳ baking new terms for the same restless question: how do you make agents actually work? “Vibe coding” gave way to “context engineering” which gave way to “harness engineering” - each one a correction for what the last one missed. Context engineering focused on what the model sees. Harness engineering expands the frame to what the system prevents, measures, and corrects - the middleware, the verification loops, the observability layer, the feedback cycles. LangChain just published a strong case for it: 13.7 points of improvement on Terminal Bench 2.0, model held constant, all gains from the harness. But Manus has been rewritten five times in six months and their biggest gains came from removing structure. Vercel stripped 80% of their agent’s tools and got better results. And the Bitter Lesson crowd is watching all of it with a knowing smile, waiting for the next frontier model to flatten everything we’re building today.
LangChain’s coding agent jumped from Top 30 to Top 5 on Terminal Bench 2.0 - a 13.7-point gain - without changing the model. Same GPT-5.2-Codex under the hood. Every improvement came from what they call harness engineering: system prompts, middleware hooks, verification loops, and trace-driven iteration. They published their traces publicly, open-sourced the agent, and laid out the recipe.
It is..compelling. But it sits at the center of a tension that the whole agent-building ecosystem is wrestling with right now - and the most interesting version of that tension isn’t “does harness engineering matter?” (it does), it’s this:
LangChain improved by adding structure. Manus improved by removing it. Vercel improved by stripping 80% of their tools. And the Bitter Lesson suggests that all of it - the adding and the removing - might be a footnote in 18 months.
So which is it? And what does it mean for anyone building agents right now?
Let’s get specific, because the details matter more than the framing.
LangChain compressed their optimization to three knobs - system prompt, tools, and middleware - and iterated using an automated trace analyzer. The flow: fetch experiment runs from LangSmith, spawn parallel error analysis agents, synthesize findings into targeted harness changes, repeat. It’s structurally similar to boosting, focused on correcting the mistakes from previous runs.
The middleware architecture is the core of the harness. Every middleware implements a protocol with hooks that intercept the agent loop at key points:
This is the surface area. Everything LangChain added to improve their Terminal Bench score is a middleware that hooks into one of these three points. The agent itself is assembled from a stack of these middlewares:
That’s the whole pattern. Each middleware is a pluggable component - add it, remove it, swap it. The harness is a composition of these layers.
The improvements that moved the needle: The PreCompletionChecklistMiddleware intercepts the agent before exit and forces a verification pass against the original task spec - not against the agent’s own code. This addresses what was their most common failure: agents writing a solution, re-reading it, going “looks good,” and stopping. No tests. No adversarial self-examination.
The LoopDetectionMiddleware uses wrap_tool_call to track per-file edit counts. After N edits to the same file, it injects context like “you’ve edited this file 8 times - consider reconsidering your approach.”
The “reasoning sandwich” sits above the middleware layer:
Together these pushed the score from 52.8% to 66.5%. The trace analysis loop that powers the iteration cycle is particularly high-leverage - most teams do this manually and inconsistently.
Manus has been rewritten five times in six months. Five architectures. Same models. And according to Peak Ji, their biggest performance gains didn’t come from adding middleware, loop detectors, or verification hooks. They came from removing things.
They removed complex tool definitions in favor of general shell execution. They removed “management agents” in favor of simple structured handoffs. They moved from an anthropomorphized org chart of specialized sub-agents to a flat Agent-as-a-Tool pattern where the main model invokes sub-agents like function calls.
One adds layers. The other strips them. Both improved.
Vercel tells a similar story. They started with comprehensive tool libraries - every capability they could think of. Results were terrible. They stripped to essentials, removed 80% of their tools, and agents became faster and more reliable with fewer choices.
Phil Schmid puts it sharply: “If your harness is getting more complex while models improve, you are likely over-engineering.”
Rich Sutton’s Bitter Lesson argues that general methods leveraging computation beat hand-coded human knowledge every time. Every middleware hook, every verification loop, every loop detector is a piece of hand-coded human knowledge about how agents fail. It works today. The question is whether you’re building investment or building debt.
Here’s where something interesting happens: across every approach - LangChain’s additive strategy, Manus’s subtractive philosophy, the Bitter Lesson skeptics - there’s convergence on one point. Self-verification matters.
LangChain’s most impactful change was forcing verification against the original spec before exit. This isn’t just a middleware trick. It’s addressing something fundamental about how current models work. They’re biased toward their first plausible solution. Left alone, they’re one-shot artists - confident, fast, and frequently wrong at the margins.
The piece of LangChain’s work that deserves more attention than the individual middleware components is the meta-process: automated trace analysis driving iterative harness improvement. Fetch experiment traces. Spawn parallel error analysis agents. Synthesize findings. Make targeted changes. Repeat. This is a flywheel. And it’s model-agnostic in a way that individual middleware hooks aren’t.
If we’re honest about the shelf life problem, the interesting exercise is sorting harness engineering techniques into two buckets: things that address fundamental properties of autonomous systems, and things that work around current model limitations.
Bucket one (likely persists): observability and tracing, verification against external specs, environment context injection, reasoning about resource budgets. These aren’t model-specific. Any autonomous system operating in the real world needs to know its environment, verify its outputs against external criteria, and manage finite resources.
Bucket two (likely dissolves): loop detection heuristics, forced exit-before-completion hooks, reasoning sandwich allocations, model-specific prompt patterns. These are patches for today’s behavioral gaps.
Three mental models are competing right now. They’re all partially right and the synthesis doesn’t exist yet.
The LangChain model: Intelligence is spiky. The harness smooths the spikes. Add structure and you can extract dramatically more value from the same model.
The Manus model: Structure is debt. Every piece of harness logic encodes assumptions about current model limitations. The winning move is subtraction.
The Bitter Lesson model: None of this will matter in two years. Build thin, disposable scaffolding. Invest in data capture and model-agnostic infrastructure.
My read: the synthesis is sequencing. You add structure to ship today. You design that structure to be removable. You instrument everything so the failure data flows back into the system. And you hold your harness loosely enough that when the model no longer needs the training wheels, you can rip them off without rebuilding the bike.
LangChain’s trace analysis loop is the part of their work most likely to compound. The individual middleware components are the part most likely to dissolve. And the published trace dataset might end up being more valuable to the field than the harness itself.
The question isn’t whether to build the scaffolding. It’s whether you remember it’s scaffolding.