Presentation: Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery

InfoQ

InfoQ2026年5月27日

Presentation: Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery

7.5Score

TL;DR · AI Summary

Aaron Erickson discusses the evolution of AI workflows, shifting from "vibe checking" to building reliable, multi-agent frameworks.

Key Takeaways

Combine deterministic software guardrails with agentic discovery.
Optimize agent hierarchies.
Leverage time-series foundation models.

Outline

Jump quickly between sections.

§Introduction
Aaron Erickson introduces the topic and mentions the presentation title 'Tools for Certainty, Agents for Discovery'.
·Background
Erickson talks about his experience at Orgspace and why AI is crucial for startups.
·Core Mechanisms
Details how to build reliable AI platforms by combining deterministic software guardrails and agentic discovery.
·Optimizing Agent Hierarchies
Discusses how to optimize agent hierarchies to improve system performance.
·Leveraging Time-Series Foundation Models
Introduces how to enhance system reliability and accuracy using time-series foundation models.
·Implementing Rigorous Evaluation Pyramids
Emphasizes the importance of implementing rigorous evaluation pyramids to ensure effective scaling in production environments.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

设计可靠的 AI 平台
- 结合确定性的软件护栏和自主发现代理
  - 优化代理层次结构
  - 利用时间序列基础模型
- 实施严格的评估金字塔

Highlights

Key sentences worth saving and sharing.

Combine deterministic software guardrails with agentic discovery.
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X
Optimize agent hierarchies.
— Paragraph 4
⬇︎ 下载 PNG 𝕏 分享到 X
Leverage time-series foundation models.
— Paragraph 5
⬇︎ 下载 PNG 𝕏 分享到 X

#AI#Platform Design#Reliability#Software Guardrails#Agent Discovery

Open original article

URL 源: https://www.infoq.com/presentations/ai-platforms-reliability/

发布时间: 2026-05-27T09:04:00+0000

Markdown 内容: [InfoQ 主页](https://www.infoq.com/ "InfoQ 主页")[演讲](https://www.infoq.com/presentations "演讲")设计可靠的 AI 平台：用于确定性的工具和发现的代理

观看演讲

速度：

下载

52:06

/presentations/ai-platforms-reliability/en/slides/Aa-1779791645391.jpg)

摘要

Aaron Erickson 讨论了 AI 工作流的演变，从“感觉检查”转向构建可靠且多代理的框架。他解释了如何结合确定性软件护栏与代理发现，优化代理层次结构，利用时间序列基础模型，并实施严格的评估金字塔以确保架构在生产中有效扩展。

个人简介

Aaron Erickson 在 NVIDIA 的 DGX Cloud 上创立了应用 AI 实验室，专注于构建解决广泛行业问题的基础模型和代理系统，例如基于时间序列的异常检测。此前，他在 ThoughtWorks 和 New Relic 担任工程领导角色，然后创立了 Orgspace。

关于会议

QCon AI 是一个由实践者主导的活动，专注于扩展这些工作负载所需的安全工程学科。它提供对同僚组织在生产中使用的架构指南和故障指标的直接访问。

INFOQ 事件

/filters:no_upscale()/sponsorship/eventsnotice/1302f11a-f90f-4a79-96d1-3dd20d032144/resources/1HarnessWebinarMay28-Transcripts-1776246863928.png)2026年5月28日，东部时间下午1点

#### 更快地发货，破坏更多：重新思考人工智能时代的交付系统

由 Eric Minick - Harness 开发解决方案高级总监和 Aaron Newcomb - Harness 高级产品营销经理呈现

/filters:no_upscale()/sponsorship/eventsnotice/7dd71c7c-4b0e-4760-b97d-232ac1816637/resources/1NeuBirdWebinarJune25-Transcripts-1777458459989.png)2026年6月25日，东部时间下午1点

#### 面向自主可靠性架构：将 AI 嵌入到您的可观测性堆栈中

由 Justin Griffin - NeuBird AI 产品经理呈现

/filters:no_upscale()/sponsorship/eventsnotice/0b46c1f1-7263-457d-82d9-12be6fa07fbd/resources/1DatadogWebinarJuly9-Transcripts-1779204853394.png)2026年7月9日，东部时间下午12点

#### 人工智能分析时代重新思考日志

由 Nicolas Jung - Datadog 日志产品经理呈现

脚本

Aaron Erickson: 我是 Aaron Erickson。我们要讨论为什么不是所有的 AI，但有时候就是 AI。“确定性的工具，发现的代理”，这是演讲的主题。谁在这里曾经有过这样的问题？“我知道，如果我们决定用一些 AI 来做这件事会怎么样？”有人有没有过这种想法，然后决定那是糟糕的想法？我有很多糟糕的想法，其中一个是我知道，显然，GraphQL，糟糕的想法。敏捷，来吧，那是什么时候？我们在2010年代做了敏捷，我们发现那是一个噩梦或者什么，也许这将成为这条充满各种良好意图的道路的新铺路石。

关键词

为此，让我谈谈我建造的一件事。在我目前的生活之前，我在一家名为 Orgspace 的公司工作。我们有一个想法，如果可以编写软件来进行重组呢？谁曾经做过重组？谁曾经经历过重组？谁曾经在经历过重组后有愉快的经历？没有，没有人经历过重组后有愉快的经历。这是我们用来帮助你做的软件。然后，在2023年发生了，关于2023年的唯一不变真理是，如果你是一家初创公司并且想要再筹集一轮资金，如果不做 AI，你就不再是初创公司了。自2023年以来，有多少非 AI 初创公司得到了资助，而不仅仅是像咖啡馆一样？并不多。我们必须对此有所回应。

我想，你知道什么会很棒的是，我们可以做一件事，比如让 ChatGPT 来为我们做重组。我知道。有些人可能在想，Aaron 只是看《黑镜》并认为这是一个好主意。不，我们不应该以这种方式开发产品。我们很绝望。我们快没钱了。我们认为，如果我们做一个插件，谁还记得 ChatGPT 插件？就像 MCP 在 MCP 成为酷之前。我们正在做的事情是，如果你只是输入，我听说人们正在扁平化他们的组织，你能帮我吗？我认为我们应该做工程。就是这样。它给出了很好的答案。它从 Orgspace，我们的重组软件中拉取了一些数据。它提出了一个计划，如果咨询公司打电话给你，他们可能会写出那个计划，因为正如在主旨演讲中所学的那样，它会产生你所能想象的最中间的输出。

Who here has ever experienced a reorganization and wondered if the plan for the reorganization wasn't the most straightforward you could imagine? It falls squarely into that category. The tool came up with a plan, and we could execute it, where great, let's actually implement the plan. We would implement the plan, and it would create a flattened organizational structure for you. It would literally branch your organization, determine who should move around, and place individuals in specific teams based on how you define the reorganization, and it would generate everything. Then you would return to this piece of software, where it would draft your reorganization email for you. Whether you wanted it in iambic pentameter, a Homerian epic, or a haiku, you could tailor it accordingly. This was akin to what you would do in 2023.

GPU Fleet Governance at NVIDIA

Why am I here? It turns out we did not become the future of HR. Everyone can relax; you’re not being reorganized by a robot. You’re being reorganized by a consultant, which is totally not a robot. Yes, I know, it was terrible. I crashed out. I ended up at a chip company somewhere in Santa Clara. I heard they do AI things. It was cool. I saw this headquarters, and I’m like, that’s a headquarters? Isn’t that amazing? What did we do? My first job there wasn’t doing AI. My first job there was building a system to allocate GPUs to initiatives. We have all these internal researchers at NVIDIA, and they all have really cool AI models they’re building, such as Nemotron, BioNeMo, and Cosmos, among other exciting projects.

In the old world, I was building HR software, so we had human resources. We also had open positions in the old software. We also had employees in the old software. Complex hierarchies. Who hasn’t seen a complex org chart with multiple lines of reporting and stuff like that? Performance management, you have calibration at the end of the year. We need that stuff. It turns out many things in my new world weren’t that different. When people would request GPUs, it’s almost like requesting headcount. In fact, it’s often more expensive. We’re talking about, if you want 1,000 H100s, that might cost $20 million, $30 million, or $40 million for a month. These are really expensive, more so than in some cases. You would have idle GPU clusters acting like open positions. We can go through the list. You had AI training jobs. You had cloud providers with regions and blocks that created a complex hierarchy of things. Then performance management, the GPUs also needed to perform. We had observability. We needed to know when fan failures happened, and so on.

What did I do? I reinvented what I’ve seen before. We built a system called Llo11yPop. You might think its spelling is funny, and you’d be right. You might wonder who named this thing? That was me. Which is why at NVIDIA, I’m never allowed to name things anymore. They took that away from me. I feel so terrible. What it did was actually quite clever. We built a system that would use AI. You’d have these things called retrieval agents. Retrieval agents were designed to do one simple thing: convert a question about something into an API call. You could use Elasticsearch. We used Elasticsearch at the time. We would convert certain types of questions, provide examples, and integrate RAG (Retrieval-Augmented Generation) to form the proper query. We found that if we constrained it, it worked really well.

We then had things called analyst agents. They were built for a different purpose. They were designed to understand what kinds of questions you should be able to ask. They might know things like, if I see these conditions about an H100, I should ask this question to gather data to validate it. Today, we might call this a deep agent framework. At the time, we were just a bunch of dummies trying to figure out how to take multiple instances of an LLM (Large Language Model) and have it actually do useful work against a database, and then turn around and say, in the full vision of this, wouldn’t it be great if this could just be an autonomous data center? For the time, it was okay, why don’t we just imagine flagging every GPU cluster to look a little weird and let the LLM analyze it for a little while, and then just raise a Jira ticket or send a Slack message or something. That’s what task agents were for, initially. The big vision was maybe instead of raising a Jira ticket to tell a human to go do the thing, what if we could instead have it automatically run some workload that would then remediate it? That was the big vision.

Lessons Learned from the Llo11yPop Project

We didn’t get all the way there. What ironically ended up happening, as we reflect on the lessons from the Llo11yPop project, which I believe we learned a lot from, one of which was there’s a lot of rare context. The question I remember always stumped the system, where someone would ask, “Where are the zombie nodes?” A zombie node in a GPU cluster is one of the groups of eight GPUs that can’t connect to the network properly. Usually, it’s something like that or something where the job is continuing to run but isn’t reporting results correctly. There are many nuanced versions of this. Sometimes the AI would get that right, but most of the time, if you didn’t actually give it some examples or even consider doing a little bit of post-training so it understands some of the vocabulary or a little bit of RAG or graph RAG or any of these techniques, you could actually have it understand that. No system is going to understand unless you go find all that rare context, all those definitions of what this is. We call these semantic layers today. I think we have terms for this. At the time, we didn’t have great terms for this. That’s one of the biggest lessons.

One of the other lessons we learned in doing this, so who here has ever built a text-to-SQL system? This is this dumb idea that if you just ask a question, the right kind of LLM might be able to write the right kind of SQL that will automatically answer it. The first couple times you do it, you'll do a couple queries. You'll be almost right, but it's just powerful enough. You're like, I want to have this system where I can just ask an arbitrary question, and if the data can address it, the generated SQL wouldn't tell me the answer. Now it turns out, you don't get very good accuracy if you make it do joins, and that's with interns. Also, the AI doesn't have very good results if you do joins. All sorts of things don't have very good results. That's why we invented ORMs, because we don't like joining things, apparently. That's my angry intern-generated picture. I have 8 interns a year. What we do is we flatten the schema. We make it do real simple things like selects, where clauses in some examples, group bys and maybe things like that that it really understands. It can get more complex things correct, but we found that if you keep it really simple, you increase your accuracy significantly. Oftentimes, this would be something that would take us from maybe 70% right to up in the upper 90s, which for some use cases is good enough.

We also learned that LLMs, at least at the time, granted, this is something that was true as of a couple years ago. I think it might still be this way, but they classify better than they code. If you ask an LLM, is it a this or that or that, and there's five things, it can actually get that pretty well. They might not code as well to figure out exactly how that is. Now they're actually getting pretty good at it, but that was the goal. One of the ways that we figured out how to take advantage of this is we said, if the query matches this pattern, say we're just counting GPUs, just run this query in this pattern, and here's where you put the variables. This is almost what we consider the way out of this road to hell, which is one of the topics that I really care about is, when do you decide to go to a deterministic system? When do you decide to make the job of the AI easier and go into this world where, it's this query and we know how it works? That's actually pretty important. That was one of the first things we learned is off-ramps to determinism really help reliability.

One of the other things we learned, and I think you've probably seen this scene in some of the other talks where we talk about you can only have so many tools when you're doing Cursor rules, you can only have so many, is if you have too many options, particularly too many options that look similar, the ability for it to classify properly actually goes down significantly. We started to notice very increased error rates if we had 50 agents that the system could choose from and a lot of them were similar in scope or similar in what they're supposed to do. Just like if you open that menu and you're at Cheesecake Factory. Who here has ever gone to that menu and had to look at it for 20 minutes to figure out what you want. Most people are like that. That's why restaurants that have simpler menus do better sometimes. LLMs, again, they suffer from this problem sometimes as well. That was one of the ways we went to solve that.

Another thing that we did was we did these purpose-built agent hierarchies. You're wondering, why is he always talking about org structure? That's a thing. It seems like a thing. In this case, it kind of worked out, where you can have a VP agent that has wide context but isn't great at any particular thing. That's like every VP of engineering. We're not good at anything, but we have a lot of context. We know how to pass context around. That's what we do. You might have a manager agent that's really good at like, how do you ask questions, down to the individual agents. The most important agents in the system are the individual agents that are doing specific tasks, just like in a system, just like in a company. It's not the managers like me that do the important things. It's the ICs. It's the ICs that actually do the real engineering, and so those things have to have a lot of context about how you do any individual task. That's how these two work. Those are some of the lessons.

One of the other major lessons we relearned was, you have a construct of a test pyramid. Who here knows what that is? Test pyramids. In the old world, back before AI, back when cavemen ruled the world and all this other stuff, you had a test pyramid where you had the end-to-end tests. You had fewer of them, and they were more expensive to run, and that's the similar thing where we would have, at the top level, the hierarchy, tests that call multiple LLMs and then have to get an answer back out of an entire system. You might have individual LLMs in this multi-LLM architecture that only do one thing, that have simpler tests. You run those more often. We found that if you have this pyramid of eval, similar to how you have a pyramid of tests in the old world, this would work really well. This was a slide from the Llo11yPop people we wrote as a blog, and this was one of the contributions from the team, which I have a team. In fact, I have people here that are on the team that are very annoyed if we don't have really good evals. You can't vibe test this stuff. Actually, accuracy matters. That was one of the first things we also realized. There's how you build your OODA loop-based system, your observe, orient, decide, act system, is something like that. People were talking about these ideas of deep agents now which are rediscovering some elements of this idea, but actually doing it for real. There's a lot you can read about that these days.

Agent Archetypes

Part two is, let's talk about agent archetypes. The first thing I like to think about when I'm thinking about, is a problem an AI problem, is a problem an LLM problem, is it some other kind of problem, or should we just write some software? Imagine what you could accomplish if you had just a whole bunch of army of dumb interns that could do one thing particularly well, but you could scale it. You can scale it up to 1,000 or 10,000. That's one of my favorite use cases. Yes, Michael Burry. If you've seen The Big Short, you might know this line. At the beginning of the movie, The Big Short, Michael Burry asks his associate, "So, I want you to look at all the mortgage bonds". The associate replies, "So, you want to know what the top selling mortgage bonds are, right?" "No, I want to know what's in each one of them. I want to know if any one of those is risky". If you can imagine if this was done in say 2024, 2025, you might say, why don't we have an agentic system instead of the guy looking at Michael Burry in this scene, actually look at every single mortgage bond, look at it for certain kinds of anomalies, look at it for certain kinds of things that might be wrong, and then just flag those.

That is what I call a worker agent problem. Similar to your task is, paint this design on all the rocks on the beach. It's not a mechanical thing, it has to be a little bit different for every single one, but that little bit of difference is something the LLM can do. The task is fundamentally the same across a very large group of records. There's some problems where this kind of worker agent approach can work really well. I love this kind of thing for, look at all the GPU clusters and find the ones that have patterns of fan failure and give it some ability to be a little bit creative, to be maybe a little bit wrong in its analysis of figuring out things that might be a new failure mode you might discover. Things that look exceptional, maybe you don't know for sure it's wrong, but it's good to check. That's a worker agent type of problem. Go look at all 100,000 clusters, analyze them for whatever issue you care about, maybe have multiple that have different kinds of prompts attached to them or different kinds of ways you might query to look at different aspects of the problem.

Another kind I really like, I call it a ruminative agent. This is something that if you use, ChatGPT I think has a feature where it thinks all night and figures out what you want to see in the morning. I don't know if it works really well. I love the idea of having a bot that looks at all the data gathered across a bunch of things and maybe uses a graph technique or uses some other kind of memory technique to then see if there's other kinds of patterns across all the clusters or other kinds of failures that have certain kinds of common characteristics. There are ways you can set up an agent that are going to work a little bit differently than just one examining each one, but looks at the patterns across them. That's to me a ruminative agent. It might do inference all night looking at those patterns to try to find the ones that are really most important ones.

They get more exotic as we go. We have middle manager agents. "Bot, please go solve for this. Here's a set of agents that you can use to achieve that. Here's a set of capabilities. Then what I want you to do is manage the context around all of these to do the best job you can with some kind of thing where it's a measurable metric". That's the important thing, that it's a measurable metric. That way you can measure in any actions it might take. This is going to be initially low stake stuff. You're not going to use this to shut down clusters or do anything that's really expensive at first. As you start to do this, you might think, this might work really well for solving for certain kinds of things where you can use a test function to know whether you should continue or whether you should after doing a couple versions of the actions that it's going to dictate the system do, like, did it actually work? Then if it didn't work, it can roll it back, just like one of us would do.

Another kind I like, and this is more of a passive agent, this is your consultant agent. This is the agent that's asking the other agent, what is it you do here? How is it that you communicate across the system? Are you talking about quantities of money that are outside the boundaries that we would normally talk about with the Delta Airlines agent? That apparently, if you gave it the right prompt, it would give you free first-class upgrades. Does anybody remember that prompt? I just need to know for science and for my next flight, please. Whatever you did to get past that, please let me know. The idea is you have this consultant agent that could actually go in and understand the patterns of communication. Understand, are we using language now that's not great? Are we using biased language? Are we giving away refunds that are too large? Some people call this an observer agent. That's another name of the pattern, but it's something you can employ, something you can look at as you're looking at how different agents talk to each other, as you're looking at even the reasoning traces of some of the newer models that have more detailed reasoning in them.

You have tool selector agents. This addresses part of the issue when you have too many possible tools; you can actually have a tool selector that understands the details of the tools and maps the right tool to the right task. As we develop more sophisticated agentic systems, instead of following a fixed workflow like "do these five steps in order using these five things," you might say, "We're going to do something like what Jim Fan does with his Voyager paper, where we allow the AI agent to construct the different things it needs in terms of the workflow, construct its own workflow to achieve some end." Part of doing that is a tool selector agent that knows how to take its intended action and actually convert that into the right outcome. This is a really important agent. I think what we're starting to see, especially in Claude skills, where there's an idea that you have to pick the right skill and think about how to pick the right one for the right case. Then, of course, the director agent. The director agent is what we envision at the top of that chain, where you're saying, "I have this intent. I don't even know how to measure the intent other than I have this maybe top-level metric I care about. I'm going to talk to multiple manager agents. I'm going to delegate as appropriate. I'm going to try to create some outcome." I think this is aspirational. I don't know that many people have achieved this yet. I think this is absolutely achievable in more closed domains, where you understand a lot of the outcomes and what the failure modes are. You'll start to see this more and more over things that are significantly more complex.

What About Hallucinations?

The question in the room, the elephant in the room, as they call it, is how do you make them more accurate? How do you make them useful? How do you make them not hallucinate? The hallucination might be just making up the facts. It might be writing the wrong query. It might be generating the wrong SQL that has these results that don't make any sense. One of the first things I remember, one of the first arguments to get it, who here goes on LinkedIn sometimes? Who here has seen that post? The R's in strawberry. We know how to solve that problem. It's a really easy problem to solve. Just use ChatGPT 5. No, you could do this for about a year and a half now when you could put in a system prompt and tell ChatGPT, "Please use Python when counting things" or "Please use Python with anything that involves math." This has been something where it could generate Python in line that would just count the number of R's because it could just write Python that knows how to do that. You've been able to get the right answer to this for quite some time. It turns out that if you allow it to write code that actually does a better job of understanding how language works, in some cases, you can actually do a good job of understanding how math works.

Another way I think about this, who here does long division in their head? One person does long division in their head, why? Why would you do that? When we have perfectly good things called calculators. We use calculators; they're deterministic. They know how to do math; they're really good at that. We don't reinvent how to do math every single time we need to do an equation. We look at a times table, even before computers, we would look at that. You use a calculator. Imagine you are Delta Airlines, or whatever airline it was, and you're given an agent—how do you make the customer happy? Would you just let them use an unlimited budget to do that? No. You would give them guardrails. If you had a brand-new associate, a human person, working the customer service desk at an airline, you would say, "I don't care what reasoning you came to; you are not giving any refunds over this amount." You would have a hard guardrail in the system that's a deterministic guardrail. That's not an AI thing; that's just a rule: you cannot give refunds over this amount. Maybe the agent could trick somebody into giving a bunch of smaller refunds somehow, but you could also manage that just like you would manage that with a human as well. You would give them guardrails, they'd have limits. You would govern them with humans. This is obvious, but it's important because I think a lot of people get on the stage and hope that just AI agents will require no humans. I don't think that's true.

Another example. Imagine you are managing a large cluster of any kind of computer or any kind of large distributed system. Let's say you do a routine operation, and do you reinvent how you fix the DNS server every single time? Do you look up how DNS works and look at the network architecture diagram and figure it out on your own? No. You use a runbook. Who here uses runbooks and still has AI? How else are you going to fix the AI if you don't have a runbook? No, you have other AI. No, that's not going to work either. Use a deterministic runbook; we don't have to reinvent this every time. One of the other things, and this gets to the theory, why I care about this in terms of weaving together deterministic systems that know how to ground the AI, along with AI that knows how to maybe discover new ways to accomplish something. If you give it the right combination of tools, the AI can discover the right combination of tools. It can use the deterministic tools to actually get reliability. That's the theory here.

We see this in practice. We see this in practice with systems like deep research. If deep research only looked at one document on the way to doing its query, it would almost always get wrong results, because the whole point of deep research is this continually re-grounding the query in real-world data to understand, step one, understand this fact. Step two, maybe do some rumination. Then step three, four, go and check it again. It does this over and over again until it gets the answer that you want. You can put as many tokens as you want at this. If it's designed well, you get this scalability where the longer you let it run, the more you get correct answers. It's actually amazing that this works. In fact, we built this blueprint in NVIDIA so that deep research isn't just something you have to go to OpenAI or some other company to get. You can actually build this into your products. If you want to have deep research in something like a thing that figures out why the GPU is going badly, or any other hard question, you could embed this into your product. Deep research is just a pattern. It's not necessarily a product that you have to buy from some other vendor.

This gets to the real point here. Most effective AI agents, they have access to useful tools, they're governed by guardrails, they have feedback loops in them. The feedback loops where you understand what queries went badly or what things didn't work are that thing that allows you to improve it. AI agents built well are the first kind of software I've seen in a long career that theoretically should get better the more that you use them. If it's a well-designed system, the system should get more accurate the more that you use them. The more training it gets, the more ability it gets to handle edge cases. When I think about this, I think about what the platforms look like. I think there really is a dividing line. I think there's two important layers in the AI platforms of the future. There is a tools layer. The tools layer is made of deterministic software. It may be written by AI, guided by a human. If it's the transaction system at, say, a payment processor, that might just be written by a human. It might be old-school software as of three years ago where it just does the thing. It's totally cool. Everything doesn't have to be AI. Even as an NVIDIA employee, I will say, not everything has to be AI. I know, it's controversial as heck.

Then the AI agents on the other side allow you to do the stochastic things, allow you to do the things that are fuzzy, allow you to do the things like interpreting a fuzzy input and trying to figure out, what category does this fall into? What is the best guess of where this would fall into? Then maybe route the call or route the sales lead or route the, I think something's wrong with this GQ, but I don't know why. Maybe this other system can give its best guess. Maybe it can give it to three different systems, and then come out with the right answer. Those are going to work with tools that then gather the data in a very deterministic way, in a very defined way. That's one of the most important things about systems like these. I think a lot of the conversation in the industry where people are saying, agents don't work or agents are hard, it's because I think there's been a lot of desire to say, why don't we just have AI agents do everything? Have AI agents construct every single tool call they'll ever make, rewrite every single tool call they'll ever make, or don't give them deterministic tools and have them regenerate the code every time. Listen, folks, this isn't magic. This is just math. I think we have to figure out where is the appropriate case for one or the other and build platforms that allow AI to construct, out of multiple tools, how to generate the right outcomes. I think once you frame it that way, it becomes a lot easier to think about how these are going to run in production.

Every now and then, regardless of how I feel about this, I read a report where somebody says AI doesn't work. I'm in New York, who here has ever ridden in one of those things? When I do this talk in San Francisco, it's two-thirds of the room. This is what I do. Every time I'm feeling bad or I'm wondering, is this real? I book one of those, and I just ride around for a while. It's almost like therapy for people that are worried about the bubble or whatever. I just go on one of those, and I think about, how much reinforcement learning did this have to undertake to be able to cross a street where there's crowds going back and forth, to drive through San Francisco's Tenderloin, which if you've ever driven through that, I try to avoid it, people just walk around wherever. It's like New York almost. No, I book a Waymo, and it makes me feel better. We have at least one evidence proof that you can deliver an AI agent that can safely deliver humans from one location to another in very unexpected conditions. If we can build AI agents that can transport us, I think we can build AI agents that can look at a GPU cluster and know what the problem is, or at least have a pretty good idea of what the problem is.

We can create autonomous systems that aren't vehicles; they just operate with information. I don't buy the hype; I think we shouldn't expect immediate success. This is a problem with some AI technologies, especially large language models (LLMs). LLMs have conditioned us to expect immediate results without effort. When using an LLM, it often appears correct, with professional writing and polite tone—like something you'd write to your CEO. Why do they like it so much? You need to investigate it. As humans, I've written reports with incorrect information. I've hallucinated, but it looked good, so people believed me. That's the issue with LLMs: the payoff is too quick, while agent development is similar to software development. The key difference lies in failure modes, which we must address to improve performance over time.

What Makes Good Agent Problems?

Perhaps you're not working on self-driving cars or data centers. What are good agent problems? My favorite type are "dumb diamonds." For example, ensuring the cover of a TPS report is properly filled out. These issues aren't difficult to solve, but they require human oversight occasionally. Your former boss might ask, "Is this filled out correctly?" Yes, it looks fine. Many business problems fall into this category, like checking boxes in a flowchart. Humans mainly verify correctness but occasionally spot exceptions. This is crucial. Another type I enjoy are classification problems. LLMs excel at categorizing items—e.g., "Is this A, B, or C?" Vision models can perform a similar task by analyzing images and deciding if they match predefined categories. These are excellent agent problems. They don't need complex OODA loops; a simple agent with three steps can classify data, leading to automated tasks that previously required human intervention but are now handled efficiently.

Content organizers. This is one of my favorite types because our team developed a system called Codex, which predates OpenAI's Codex. Julie works on it daily. Rewrite content Y into format Z using prompt X. We use this technique, which we call template RAG. Our system gathers all meeting transcripts from Microsoft Teams. Unfortunately, NVIDIA also uses Teams. We input the transcripts into a template that defines the wiki structure. After processing, the output is a detailed meeting summary formatted exactly as desired. Later, if you need to analyze the impact on an industry or team, rerun the content through the templates to gain new perspectives based on specific summaries or tools for adding metadata or additional information. This creates an organized, easily readable wiki of content, similar to how a note-taker organizes information for clarity.

Another system is a scaled inspector. I mentioned this briefly earlier. Examine every X, whether it's GPU clusters, transactions in a stream, or any high-volume data. Identify specific conditions and allow the AI to determine if something is suspicious—e.g., a transaction pattern indicating fraud or a GPU cluster showing signs of failure. Extend this approach to various potential issues, giving the AI flexibility to suggest alternative explanations you hadn't considered. With limited budget, this method allows continuous monitoring of vast amounts of data. I find this a powerful technique.

Another type involves technologies that aren't necessarily large language models (LLMs), and I want to discuss this topic, which is another common mistake in the industry: equating AI to LLMs. LLMs are a subclass of AI, specifically a very small subclass. They simply happen to receive a lot of attention now. Constraint navigators are fascinating; they can define a solution space that is extremely difficult to search. For instance, we have a problem where we need to reorganize our GPU clusters. It's akin to a bin packing problem, but with tens of thousands of bins and millions of GPUs that must be arranged in very specific ways, ensuring that if you're running a thousand GPUs for a large training job, they should all be in the same building. This is a challenging bin packing problem. It's somewhat similar to the game of Go, where searching the entire solution space for Go moves is impractical. That's why AlphaGo was such an important discovery—it demonstrated the ability to statistically determine the best possible move given the constraints and rules of the game. By honoring these constraints and understanding the rules, you can solve very complex problems. While it may not always be the optimal solution, it provides your best guess for how to reorganize the GPUs among other possibilities.

My advice to people is to start with small, composable skills. You can achieve this with Claude skills now. They are quite powerful. Chain two or three together, and over time, you will begin to build skills that compose more of them, making you more comfortable. Experiment with your toolset and explore what you can do within the safe boundaries of your goals. This empowers you to say, "Now we can tackle larger problems" once you become proficient in smaller agents. I say this because you've likely read studies showing that only 5% of AI projects succeed, which is impressive. However, 95% fail, often due to being overly ambitious. That's why I recommend starting small, getting comfortable with it, and then moving forward.

Diverse AI's

This brings me to the point I truly want to emphasize: there is more than one type of AI. I think about AlphaGo solvers, which come in various combinations. There's one that generates code, using generic algorithms and an AlphaGo-style solver for more advanced code generation. This goes beyond what you can do with an LLM. One of the things my team develops are called time series foundation models. I'll discuss this further. Protein language models are another area; who here knows what those are? These models allow you to discover novel molecules that might cure diseases. How they work: Instead of using words or human language, they use the language of DNA and chemistry, employing the same transformer model to identify the best molecule to solve a disease or react to an antigen. We can also determine if a molecule might cause increased liver toxicity, helping drug companies avoid FDA-approved drugs that are too toxic to the liver. This is just one example; there are many others.

In fact, we developed a system similar to the OODA loop we discussed earlier, which processes healthcare data. Take your patient's medical record and their presenting condition. Has anyone ever been to the emergency room? Did you know that the doctor at the ER is essentially a "vibe doctor"? You thought vibe coding was bad. Let's talk about the vibe doctor. Their job is to observe you for five minutes and deduce whether someone can't breathe and needs CPR. Or, they might identify a condition based on parts of the medical record. Maybe I have time to look it up. I probably don't. Then, they decide which medication you should receive. This should terrify you—AI doesn't currently help us scale to find the second, third, fourth, or fifth best option, nor does it help us identify any overlooked issues by the doctor, ultimately determining the correct course of action. In my vision of the regulated medical space, where AI faces significant challenges, I believe this will be one of the most critical applications. Microsoft has a video called Health Superintelligence, which applies a multi-agent model to healthcare.

Healthcare is inherently stochastic. When doctors diagnose patients and are 45% correct, that's considered good. This is actually good, meaning that if you can solve this with AI, it's 80% better. The accuracy rate in finance that we would never accept is twice as high as in healthcare decisions. This is crucial to consider when thinking about the power of AI: where are those questions that are stochastic, with the bar set so low? If we're 80% right, it's significantly better than before. Autonomous driving is similar. It's not perfect, but human drivers are terrible, so it only needs to meet the minimum standard of not being worse than a human driver, making it technically superior. In fact, it's shown to be about 10 times better on a per-mile basis. These are the truly interesting areas where AI can make a significant impact.

markdown

I think of domain-specific reasoning models. Imagine a general reasoning model that incorporates Chain of Thought directly into its architecture. You can achieve this. You've already developed mathematical reasoning models. We've created biological reasoning models too. We've explored various types of reasoning models. There are smaller models that you can use to compare against a larger one, thereby improving the performance of the larger model. I spoke with a healthcare company that planned to invest $1.5 billion in training a model, and I upset their sales team by suggesting that they might achieve similar results with a model one-tenth the size, coupled with a protein reasoning model. It actually worked out quite well. He thanked me, and I jokingly asked if he could pay me a commission for my suggestion. They declined, but the point stands: you might achieve better results with fewer resources using reasoning models. We also consider world models. Yann LeCun, now at Meta, discusses world models, highlighting that interacting with the world for just ten minutes can teach you far more than an LLM would ever know. As we look at advancements over the next five years, it will be about using world models that accumulate real-world experience. We have a model called Cosmos that you can further train. There’s an early version of what we’ll truly see in five years—these highly detailed world models—to understand not just word relationships, but also object relationships in three-dimensional or four-dimensional space. While it may sound like science fiction, these models are significantly larger, potentially having hundreds of trillions or even quadrillions of parameters to grasp all this information, and we lack the computational power to achieve this. That’s why we’re aggressively building out data centers worldwide.

## NV - Tesseract, and Open-Source Models

I wouldn’t be here if I weren’t promoting something I work on. Our team develops a project called Tesseract. Tesseract is a new model, a time series transformer. Has anyone seen one of these before? Time series transformer models. ChatGPT predicts the next paragraph after the context you provide. It’s trained on a vast corpus of global text. You input your question, and it predicts the answer. It does this quite well. Time series transformers operate differently. They’re trained on concepts of time and the relationships between data over time. A lot of data is fed into one of these models focused solely on time series data, allowing you to forecast data based on patterns within the data. Using this approach, we perform tasks such as anomaly detection. If used on a production line, we can pinpoint where an anomaly began based on signal patterns and data patterns, ensuring that every object produced after this point likely contains an error.

We can also save inventory by predicting demand. This technology works in financial services, supply chain management, and many other sectors where large datasets exist, patterns aren’t always clearly understood, but are discoverable. By training these models similarly to how you would with an LLM but using data, they can reveal new insights and enable actual forecasting. These models can become surprisingly accurate. Training them might be slightly more costly; post-training one of these time series foundational models could cost a million dollars. Unlike human language, they require additional post-training. Once you’ve achieved this, if you make economically valuable decisions at scale, this technique is among the best for better anomaly detection or forecasting, providing a clearer understanding of what will happen. These are models that predict the future. I’m glad to work on them.

Another exciting aspect is a new open-source model we’ve developed. Everyone is constantly releasing new models. Qwen is coming out, followed by Llama, and others. I don’t know if there will be another Llama or if the next one will be exceptional. I’m uncertain whether we’ll simply adopt whatever OpenAI decides to open-source, including non-state-of-the-art models. They lack the incentive to release them. However, what we aim for is to have the best possible open-source models available in the industry to set the minimum standard. What Nemotron offers allows us to say, not only do we have that, but we also have a state-of-the-art model with open weights. We provide you with all the data we used to train it, enabling you to post-train it, fine-tune it, study it, or do anything else. This aligns closely with NVIDIA’s open-source strategy, aiming for the entire market’s success. To achieve this, we need as much open-source software as possible. This year, we’ve been heavily investing in open-source projects to the point where almost everything we do revolves around working on more open-source initiatives.

## Summary

I'll start summarizing this a little bit. Determinism is good. Determinism, I hear sometimes is a word when people don't want to use AI. I think it's a false dichotomy. Determinism is good. Stochastic systems are good. Non-determinism is good. They just use what they think for the right thing. We don't have to be a purist about it. Dumb agents are fine. I like dumb agents to start because they're probably going to work. I care about, does it work? As you go up the learning curve of figuring out how these work, start with dumb agents. Great agents are defined not just by automating a workflow. We've known how to automate for a long time. What they allow you to discover. New ways of working. New ways of understanding why something fails. This is where systems get better over time because they can discover things. Then when you find out something works, you can just make that part of the system. I think that's incredible. The feedback mechanism can make it better in the way that it's used.

Bottom up over top down, one of the biggest reasons these things fail. I think it was even pointed out in that MIT study that when it's top down from the CEO that doesn't have enough context on how things work, there's all sorts of failure modes. They may not understand how much institutional knowledge is required to do something effectively. They just think the process is check, check, check, and it's done. When it turns out, bottom up, they actually know all the complexities. I actually would rather have the agents built on the person that's actually working on that work day-to-day at the coalface than some hypothetical way that the CEO might understand. I love CEOs, but they don't always understand the details of everything about how a company is run. Rare context, that is the specific way that your company might understand a topic or the way that your company might understand how a zombie node works. Go back to that point earlier. It's that rare context that is the thing that the LLM would never know that's going to be really important for you to employ in these systems if you want them to be effective. Because people are going to talk to these systems or have language in these systems that is very nuanced to your company or to your organization.

Mercilessly use evals. If you don't have evals, you're not serious about what you're doing. You're just vibe checking. That's not good enough. Use evals. LLM as a judge is pretty good. There are other better ways to do it. If you're not measuring accuracy, I don't think you're serious about what you're doing. Design the system to improve over time. Feedback loops, this is how systems in general get better. If we were not with AI, we would say feedback loops matter. That's why we invested in Agile. To get feedback to make it better. We can automate that process. We can get feedback on was the answer right or not. We can get that little bit of feedback from humans. That's why every AI app has an up-down button. If those bits of feedback are used and are rolled back into the system, ideally in an automated way, it gets better with use. Really important.

Final Note

I will wrap this by saying the world will belong to those with the wildest imaginations. This gets back to the theme of our keynote speaker, which is, there's AI systems that might be able to connect two distant ideas, might be able to connect chocolate and peanut butter together, but they don't yet. I know some people have ideas for that, but for now, the people that can connect two distant ideas together into some unified idea, say for example chips and AI, that can connect these two things together in a way that wasn't done before, I think that's going to be the people that do the most innovation over time. I really encourage you to not limit yourself with what's possible. Limit yourself with what evals tell you don't work, but keep your mind open to a lot of these ideas.

See more [presentations with transcripts](https://www.infoq.com/transcripts/presentations/)

Recorded at:

May 27, 2026

by

![Image 6: Author photo](https://www.infoq.com/profile/Aaron-Erickson/)Aaron Erickson

Co-Founder and CEO at Orgspace