AI Engineer World's Fair 2026: Harness Over Model

TL;DR: AI Engineer World's Fair is an annual conference for engineers who build products on top of models rather than train the models themselves. The 2026 schedule had 560 sessions, and almost all of them came down to one idea: the bottleneck has moved from the model itself to the tooling around it, meaning traces, sandboxes, evals, boundaries, and UI. Below is a breakdown based on the available recordings: which patterns repeated from talk to talk, how agents are now fixed after they fail, and what to watch if you only have time for five videos.

AI Engineer World's Fair 2026 has already taken place, and I was left with a persistent feeling that I had missed something important. A yearly snapshot of applied AI engineering in one place. And missing it was frustrating: the official schedule listed 560 sessions. Five hundred talks, workshops, and keynotes over several days. No one can watch all of that with their own eyes, so I decided to put together a Russian-language navigator to understand what actually happened there.

Just a couple of years ago, the main question was "when will the model get smarter," and everything depended on the answer. Now the models have become good enough to entrust them with almost everything, and suddenly it turned out that intelligence is no longer what breaks. What breaks is the environment around it: how to give an agent tools, where to set boundaries, how to record every step, how to reproduce a failure, and how to understand whether it has actually improved. That is what the whole conference was about. Not about the model, but about its harness.

How I analyzed it (and why this is already part of the conclusion)

Watching 560 sessions in a row makes no sense, so I built an agentic pipeline. It pulls the official schedule, YouTube metadata, subtitles for the available recordings, and runs each one through summarization, storing everything in a single registry with topics, links, and timestamps.

Out of the 560 sessions in the schedule, 82 unique videos were publicly available on YouTube at the time of collection. I also broke down three multi-hour main-stage streams into 55 thematic segments. More than five hundred sessions, however, I could not get as separate recordings at all: they simply are not publicly available yet. So this should not be considered a complete archive of the conference. It is a navigator for the available part.

There were plenty of pitfalls. One talk about an RL agent for ETL pipelines returned HTTP 429 on subtitles, so I had to download the audio and run it through a local whisper.cpp, marking the transcription quality as a fallback. A small thing, but a telling one: even just to study a conference about agents, I needed my own little harness with error handling and fallback paths.

I published everything I managed to assemble separately: a Russian-language navigator in a GitHub repository and an SPA on top of it on GitHub Pages. It contains Russian summaries, a topic map, a watchlist, and links to the originals. It is a navigator for the available materials, not a replacement for the originals. Further down, I will refer to specific talks; all of them can be opened and checked.

An agent is an execution system, not a "model with tools"

The most common thesis of the conference was this: an agent is not an LLM that was given a couple of functions; it is an execution system. The model suggests the next step, while the platform around it verifies, applies, and records it. An agent has state, rules, an action log, constraints, failure recovery, and tests. The model is one component among many here, even if it is the central one.

Tooling has come to the forefront not instead of model progress, but because of it. While the model made mistakes on every other step, it was too early to argue about logs and boundaries. Once it became reliable enough to be trusted with real actions, the question shifted: now what matters is not whether it will get even smarter, but whether we can explain, reproduce, and constrain what it already does.

This is formulated best in the talk with the telling title What if the harness mattered more than the model?: the leverage for quality is increasingly not in the model weights, but in the harness around it. A conversation about deterministic infrastructure for agents points in the same direction. And on the main stage, a separate segment examined the idea of separating the task from the model: the platform describes and controls the task, while the model only performs its part. This was a segment inside a long stream, without separate YouTube chapters, so I am giving a timestamped link in the stream rather than treating it as an exact quote.

If you keep only one idea from the conference in mind, make it this one. Everything else is a special case of it.

Failed in prod? Show the receipt

The first special case is an unpleasant one. An agent did something in production, everything broke, and you are standing in front of the logs trying to understand what happened. Saying "the model hallucinated" is not enough. You need to know what the agent saw, which tools it called, why it chose that particular action, and how to reproduce the failure now.

The developers use a simple metaphor: agents need receipts. Like a store receipt, but for every action. What was called, with which inputs, and what was confirmed. The talk Agents Need Receipts is exactly about a verifiable trail instead of "I think I called the right tool."

The talk Your Agent Failed in Prod. Good Luck Reproducing It also hits a popular misconception. Many people believe that setting temperature=0 is enough to make an agent reproducible. It is not. Reproducibility does not come from zero temperature, but from record/replay: you record the entire run, then stub the LLM nodes and run the tools again. A production incident becomes a test that protects you from the same failure in the future.

And the talk The Log Is The Agent takes the idea all the way: the event log is not a byproduct of the agent's work; it is its foundation. The architecture starts with what and how you record, while the prompt comes later.

A hundred tools in the prompt is not a superpower

There is a temptation to think that the more tools an agent has, the more powerful it is. Put a hundred functions into the prompt and let it choose. In practice, that gives you an agent that gets confused in its own arsenal and makes more and more mistakes when choosing.

The talk with the straightforward title The 100-Tool Agent Is a Trap shows why a bloated agent fails and what to do instead. The recipe looks like ordinary search: tools are placed in an index, relevant ones are searched for a specific request, and only those are loaded. Not the entire toolbox in every request, but semantic routing and loading on demand.

Next to it is the talk Skills are the New SDKs, and it is especially close to me. Skills are becoming what SDKs used to be: they need to be indexed, versioned, tested, and executed in a controlled environment. In other words, agent skills are treated as normal software assets, not as a list of spells in a system prompt.

Evaluation lives in production, not on a slide

Next, the conference hits benchmarks, the industry's favorite sore spot. One pretty number on a leaderboard has long guaranteed nothing.

In the talk Production Evals For Agentic AI Systems, evaluation is treated as a production loop. You need to measure scenario outcomes: whether the agent reached the goal, how successfully it called tools, how often it escalated, where it violated safety, how much it cost, and how it recovered after a failure. This is no longer "76% on a benchmark," but a set of signals that show whether you have a living product or not.

The reverse side of the same problem appears in the talk with the sad title User Signal Dies at the Retrieval Boundary. The quality signal dies at the retrieval boundary: the user marks an answer as useless, that lands in the trace, but retrieval on the next request pulls up the exact same irrelevant document again, because the evaluation never reached it. If traces and evaluations remain a pretty dashboard from which search learns nothing, the same mistake repeats on every run.

Model output is not yet an interface

Many agentic products stumble over the same thing: the layer between the monitor and the chair.

The talk Agent Output Is Not UX says it directly: raw model output is not an interface yet. Users need a layer on top: state, undo, a clear display of what the agent did, and control over its actions. In The UX of AI, this is unpacked into concrete requirements for products with documents and files: guided workflows, sources, side panels, undo/redo.

Separate mention goes to my favorite title from the entire conference, Browser Agents Don't Need Better Models. They Need Better Eyes. Browser agents do not need a bigger model. They need proper vision: a compact structural representation of the page instead of a wall of screenshots, diffs between states, and feedback that an action failed. And there was also a very down-to-earth but important talk, Your Agents Need a Save Button: a save button for an agent is not a tiny UI detail, but a way to control the state of long-running work.

None of these talks asks for a better model. All of them ask for better tooling.

Code speed is also debt

What hit me hardest was precisely the speed of code generation, because it strikes at the year's main hype.

The story is sold like this: coding agents write code faster, therefore the team works faster, therefore progress. The talk Your Coding Agent Is Creating Review Debt carefully exposes the substitution. Code really is generated faster. But people still have to understand it, review it, and maintain it, and their throughput has not increased. The difference turns into debt. Not technical debt in the old sense, but review debt: a queue of changes that no one has really understood, but that are already in the system.

The thought sounds boring; the consequences are not. If code is written faster than the team can comprehend and verify it, you have not accelerated development. You have shifted the load from writing to review and maintenance and pretended that things got better.

The talk SWE-Marathon: Evaluating Coding Agents at Billion-Token Scale shows that evaluating such agents is itself a difficult task: coding agents have to be run at enormous scale, across billions of tokens, just to see where they break. And on the main stage, a separate segment examined lessons from analyzing one million AI-generated PRs. That, too, was a segment of a long stream without a separate chapter, so I am giving a timestamped link. The scale of one million pull requests already shows that the topic has moved from "maybe we imagined it" into the realm of the measurable.

Where all of this converges

If you put the conclusions together, one shift emerges. The industry is not moving toward "agents will do everything by themselves," but toward the emergence of a separate infrastructure layer on which agents can operate safely and reproducibly. Tooling, traces, sandboxes, permissions, observability, evals, UI, and clear escalation to a human. Agentic products are gradually being designed as distributed systems, not as chat with functions bolted on.

The same recurring patterns, collected into a table:

Harness pattern	What it means	Key talk
Agent as an execution system	The model suggests a step; the platform verifies, applies, and records it	What if the harness mattered more than the model?
Receipts and replay	A receipt for every action; a production incident becomes a test	Your Agent Failed in Prod; Agents Need Receipts
Semantic routing	Not a hundred tools in the prompt, but an index and loading the right ones	The 100-Tool Agent Is a Trap
Production evals	Measure scenario outcomes, not one benchmark number	Production Evals For Agentic AI Systems
Agent UX	A layer above output: state, undo, and "vision" for the browser	Agent Output Is Not UX; Browser Agents Don't Need Better Models
Review debt	Code is generated faster than the team can review it	Your Coding Agent Is Creating Review Debt

After analyzing the conference, the list of talks turned into a list of requirements for my own project: trace/replay by default, not someday later; semantic routing instead of the full toolbox in every request; limits, state saving, permissions, and observability at every step; a separate vision layer for browser and office agents; and calculating cost and risk for every action, not just for the final result. Nothing magical. Boring engineering, which is exactly what separates a working product from a pretty demo.

Where to start if time is short

I will repeat the main idea once more because it is worth it. I built an agentic pipeline to analyze a conference that itself turned out to be about the harness. This is the whole of 2026 for AI engineering: the winner is not the one whose model scored more benchmark points, but the one who can explain, reproduce, constrain, and verify what their agent does.

I analyzed 82 publicly available recordings out of 560 sessions in the schedule. More than five hundred sessions did not make it into this breakdown simply because they are not publicly available yet. So this is a view of the available part, not a verdict on the entire conference.

If you want to watch right now and only five videos, I would start with these: Browser Agents Don't Need Better Models, Your Agent Failed in Prod, The 100-Tool Agent Is a Trap, What if the harness mattered more than the model?, and Skills are the New SDKs. Everything else, with summaries and timestamps, is in the navigator.

Stay curious.

I write about artificial intelligence, language models, and tools for developers. I test models and services on real tasks and share my findings in my Telegram channel.

Frequently asked questions

What is AI Engineer World's Fair?

An annual applied AI engineering conference: it brings together engineers who build products on top of models rather than train the models themselves. It is a yearly snapshot of what is happening in the industry around LLMs.

How many talks were there at AI Engineer World's Fair 2026, and are all of them available?

The official schedule listed 560 sessions. At the time of the breakdown, about 82 recordings were publicly available on YouTube, roughly one in seven sessions. The rest were not publicly available at the time the article was prepared.

What is review debt?

A queue of agent-generated code that no one has really understood. Code is written faster than the team can review and maintain it, and that difference accumulates as debt: the changes are already in the system, but no one has understood them.

Does temperature=0 make an agent reproducible?

No. Zero temperature does not guarantee repeatability. Record/replay does: you record the entire run, then stub the LLM nodes and run the tools again, turning a production incident into a test.

How is an agent different from an "LLM with tools"?

An agent is an execution system, not a model that was given a couple of functions. It has state, rules, an action log, constraints, failure recovery, and tests. The model is one component among many here, even if it is the central one.

Which AI Engineer World's Fair 2026 talks should I watch first?

AI Engineer World's Fair 2026: A Read of the Talks and Where AI Engineering Is Heading