Noa Sasson - AI Engineer

We keep minting new terms for the same restless question: how do you make agents actually work? "Vibe coding" gave way to "context engineering" which gave way to "harness engineering" - each one a correction for what the last one missed. Context engineering focused on what the model sees. Harness engineering expands the frame to what the system prevents, measures, and corrects - the middleware, the verification loops, the observability layer, the feedback cycles. LangChain just published a strong case for it: 13.7 points of improvement on Terminal Bench 2.0, model held constant, all gains from the harness. But Manus has been rewritten five times in six months and their biggest gains came from removing structure. Vercel stripped 80% of their agent's tools and got better results. And the Bitter Lesson crowd is watching all of it with a knowing smile, waiting for the next frontier model to flatten everything we're building today. The terminology keeps evolving because nobody's settled the underlying question. And right now, the most interesting version of that question isn't "does harness engineering matter?" - it's whether the right move is to build more of it or less.

LangChain's coding agent jumped from Top 30 to Top 5 on Terminal Bench 2.0 - a 13.7-point gain - without changing the model. Same GPT-5.2-Codex under the hood. Every improvement came from what they call harness engineering: system prompts, middleware hooks, verification loops, and trace-driven iteration. They published their traces publicly, open-sourced the agent, and laid out the recipe.

It is..compelling. But it sits at the center of a tension that the whole agent-building ecosystem is wrestling with right now - and the most interesting version of that tension isn't "does harness engineering matter?" (it does), it's this:

LangChain improved by adding structure. Manus improved by removing it. Vercel improved by stripping 80% of their tools. And the Bitter Lesson suggests that all of it - the adding and the removing - might be a footnote in 18 months.

So which is it? And what does it mean for anyone building agents right now?

What LangChain Actually Built

Let's get specific, because the details matter more than the framing.

LangChain compressed their optimization to three knobs - system prompt, tools, and middleware - and iterated using an automated trace analyzer. The flow: fetch experiment runs from LangSmith, spawn parallel error analysis agents, synthesize findings into targeted harness changes, repeat. It's structurally similar to boosting, focused on correcting the mistakes from previous runs.

The middleware architecture is the core of the harness. Every middleware implements a protocol with hooks that intercept the agent loop at key points:

middleware.py

class AgentMiddleware(Protocol):
    def before_agent(self, state: AgentState) -> AgentState:
        """Runs once at initialization to populate state."""
        ...
    def wrap_model_call(self, call_model, state: AgentState):
        """Intercepts every LLM invocation."""
        ...
    def wrap_tool_call(self, request, handler):
        """Intercepts tool execution to transform inputs/outputs."""
        ...

This is the surface area. Everything LangChain added to improve their Terminal Bench score is a middleware that hooks into one of these three points. The agent itself is assembled from a stack of these middlewares:

agent_setup.py

from deepagents import create_deep_agent
agent = create_deep_agent(
    model="openai:gpt-5.2-codex",
    middleware=[
        LocalContextMiddleware(),      # scans cwd, discovers tools on startup
        PreCompletionChecklistMiddleware(),  # forces verification before exit
        LoopDetectionMiddleware(max_edits=5),  # nudges after N edits to same file
    ],
)

That's the whole pattern. Each middleware is a pluggable component - add it, remove it, swap it. The harness is a composition of these layers.

The improvements that moved the needle: The PreCompletionChecklistMiddleware intercepts the agent before exit and forces a verification pass against the original task spec - not against the agent's own code. This addresses what was their most common failure: agents writing a solution, re-reading it, going "looks good," and stopping. No tests. No adversarial self-examination.

The LoopDetectionMiddleware uses wrap_tool_call to track per-file edit counts. After N edits to the same file, it injects context like "you've edited this file 8 times - consider reconsidering your approach." This targets doom loops - agents making tiny variations to the same broken approach 10+ times.

The "reasoning sandwich" sits above the middleware layer - it's a compute allocation strategy. xhigh reasoning for planning, high for implementation, xhigh for verification:

Phase 1 (Planning):     xhigh reasoning  → "fully understand the problem"
Phase 2 (Implementation): high reasoning  → "execute the plan"
Phase 3 (Verification):  xhigh reasoning  → "catch mistakes before submitting"

Running at xhigh everywhere scored 53.9% (timeouts). Running at high everywhere scored 63.6%. The sandwich pushed to 66.5% - though LangChain notes the differences across reasoning splits weren't large.

Together these pushed the score from 52.8% to 66.5%. The trace analysis loop that powers the iteration cycle is particularly high-leverage - most teams do this manually and inconsistently.

The Manus Counterargument: Gains Through Subtraction

Manus has been rewritten five times in six months. Five architectures. Same models. And according to Peak Ji, their biggest performance gains didn't come from adding middleware, loop detectors, or verification hooks. They came from removing things.

They removed complex tool definitions in favor of general shell execution. They removed "management agents" in favor of simple structured handoffs. They moved from an anthropomorphized org chart of specialized sub-agents (Researcher, Coder, Writer) to a flat Agent-as-a-Tool pattern where the main model invokes sub-agents like function calls.

The contrast in code tells the story. LangChain's harness assembly:

langchain_harness.py

agent = create_deep_agent(
    model="openai:gpt-5.2-codex",
    middleware=[
        LocalContextMiddleware(),
        PreCompletionChecklistMiddleware(),
        LoopDetectionMiddleware(max_edits=5),
        TimeBudgetMiddleware(warn_at_pct=0.8),
    ],
)

Manus's direction of travel (conceptually):

manus_evolution.py

# v1: 12 specialized tools, 3 management agents, complex routing
# v2: 8 tools, 2 management agents
# v3: 5 tools, agent-as-a-tool pattern
# v4: shell execution + structured handoffs
# v5: ↓
agent = create_agent(
    model=frontier_model,
    tools=[shell, browser],  # that's it
    sub_agent=lambda goal: spawn(goal, return_structured=True),
)

One adds layers. The other strips them. Both improved.

Vercel tells a similar story. They started with comprehensive tool libraries - search, code, file, API tools - every capability they could think of. Results were terrible. Agents got confused, made redundant calls, took unnecessary steps. They stripped to essentials, removed 80% of their tools, and agents became faster and more reliable with fewer choices.

Phil Schmid puts it sharply: "If your harness is getting more complex while models improve, you are likely over-engineering."

The Bitter Lesson Hanging Over Everything

Rich Sutton's Bitter Lesson argues that general methods leveraging computation beat hand-coded human knowledge every time. The history of AI is absolutely littered with carefully engineered systems that got steamrolled by brute-force approaches running on bigger hardware.

This puts harness engineering in an uncomfortable position. Every middleware hook, every verification loop, every loop detector is a piece of hand-coded human knowledge about how agents fail. It works today. The question is whether you're building investment or building debt.

Self-Verification: Where the Perspectives Converge

Here's where something interesting happens: across every approach - LangChain's additive strategy, Manus's subtractive philosophy, the Bitter Lesson skeptics - there's convergence on one point. Self-verification matters.

LangChain's most impactful change was forcing verification against the original spec before exit. This isn't just a middleware trick. It's addressing something fundamental about how current models work. They're biased toward their first plausible solution. They don't have a natural tendency to enter a build-verify loop. Left alone, they're one-shot artists - confident, fast, and frequently wrong at the margins.

The Trace Analysis Loop: The Quietly Revolutionary Part

The piece of LangChain's work that deserves more attention than the individual middleware components is the meta-process: automated trace analysis driving iterative harness improvement.

Fetch experiment traces. Spawn parallel error analysis agents. Synthesize findings. Make targeted changes. Repeat. This is a flywheel. And it's model-agnostic in a way that individual middleware hooks aren't.

The Real Question: What Persists?

If we're honest about the shelf life problem, the interesting exercise is sorting harness engineering techniques into two buckets: things that address fundamental properties of autonomous systems, and things that work around current model limitations.

Bucket one (likely persists): observability and tracing, verification against external specs, environment context injection, reasoning about resource budgets. These aren't model-specific. Any autonomous system operating in the real world needs to know its environment, verify its outputs against external criteria, and manage finite resources.

Bucket two (likely dissolves): loop detection heuristics, forced exit-before-completion hooks, reasoning sandwich allocations, model-specific prompt patterns. These are patches for today's behavioral gaps. Tomorrow's models probably won't need a middleware layer to remind them to test their code. They'll just... test their code.

Where This Leaves Us

Three mental models are competing right now. They're all partially right and the synthesis doesn't exist yet.

The LangChain model: Intelligence is spiky. The harness smooths the spikes. Add structure - verification loops, context injection, loop detection, reasoning budgets - and you can extract dramatically more value from the same model.

The Manus model: Structure is debt. Every piece of harness logic encodes assumptions about current model limitations. As models improve, the winning move is subtraction - fewer tools, simpler patterns, more autonomy for the model. The gains come from getting out of the way.

The Bitter Lesson model: None of this will matter in two years. General methods leveraging computation will beat hand-coded harness logic. Build thin, disposable scaffolding. Invest in data capture and model-agnostic infrastructure. Let the models do the hard work.

My read: the synthesis is sequencing. You add structure to ship today. You design that structure to be removable. You instrument everything so the failure data flows back into the system. And you hold your harness loosely enough that when the model no longer needs the training wheels, you can rip them off without rebuilding the bike.

Here's what this actually looks like. Imagine a team shipping a coding agent in Q1 2026. Month one, their agent keeps editing the same broken file in circles - classic doom loop. They add a loop detection middleware. Score jumps. Ship it. Month two, the agent writes plausible code but never runs the tests. They add a pre-exit verification hook. Another jump. Ship it. Month three, they notice the agent floundering in unfamiliar repos. They add a startup middleware that scans the directory, finds available tools, maps the environment. Better onboarding, fewer wasted cycles. Ship it.

Now it's month four and a new model drops. This one doesn't doom loop - it naturally steps back after two failed attempts. The loop detection middleware fires anyway, and now it's interrupting good behavior. They rip it out. Month five, the model gets better at adaptive reasoning. The reasoning sandwich they carefully tuned is now fighting the model's own internal compute allocation. They rip that out too. Month six, the verification hook still matters. The environment scanner still matters. And the tracing infrastructure that logged every doom loop, every failed verification, every wasted cycle across all six months? That's the most valuable thing they built.

Three middleware components added. Two ripped out. One persisted. The traces from all of them compounded. That's the pattern.

LangChain's trace analysis loop is the part of their work most likely to compound. The individual middleware components are the part most likely to dissolve. And the published trace dataset might end up being more valuable to the field than the harness itself.

That's not a criticism. It's the nature of building on a shifting foundation. You engineer for today's reality while designing for tomorrow's.

The question isn't whether to build the scaffolding. It's whether you remember it's scaffolding.