为何（资深）工程师难以构建AI Agent — Philipp Schmid，Google DeepMind

AI Engineer

AI Engineer视频2026年5月30日

为何（资深）工程师难以构建AI Agent — Philipp Schmid，Google DeepMind

7.8内容质量

可直接观看的视频资源打开原视频

TL;DR · AI 摘要

资深工程师难建AI Agent主因是开发范式从确定性编程转向提示-反馈迭代；文本成为新状态载体，工程师需从‘交通管制员’转为‘调度员’。

核心要点

AI Agent开发采用‘定义目标→运行→观察→调提示/工具→再运行’迭代闭环，非传统线性流程。
LLM使文本承载语义状态（如研究计划），支持动态约束（如‘聚焦美国市场但排除加州’），无法用布尔标志建模。
工程师角色须从‘交通管制员’（控每步）转为‘调度员’（仅设目标），接受非确定但有效的执行路径。

结构提纲

按章节快速跳转。

§开发范式根本转变：从线性流程到迭代探索
构建AI Agent不再遵循传统软件的PRD→编码→测试→部署流程，而是采用‘定义指令→运行→观察→调整提示/工具→重试’的闭环迭代模式。
·文本作为新状态载体
LLM使自然语言文本能承载语义状态（如研究计划、用户偏好），替代了传统布尔标志与结构化数据，支持动态、细粒度的上下文调整。
·工程师角色转型：从交通管制员到调度员
工程师不再精确控制Agent执行路径，而是仅设定目标（如‘去伦敦’），由Agent自主选择方式（火车/飞机/开车），接受非确定性但有效的结果。
·记忆与个性化无法结构化建模
用户偏好（如温度单位切换）等场景无法通过固定字段表达，必须依赖上下文理解与自然语言指令实现动态适配。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

为何资深工程师难建AI Agent
- 开发范式转变
  - 线性流程 → 迭代反馈环
  - 确定性 → 非确定性结果
- 状态表示革新
  - 文本替代结构化数据
  - 语义可动态扩展（如用户偏好）
- 角色认知重构
  - 交通管制员 → 调度员
  - 控制步骤 → 定义目标

金句 / Highlights

值得收藏与分享的关键句。

传统软件中我们是交通管制员——控制红绿灯、车速与道路；现在构建Agent时我们是调度员——只说‘我要去伦敦’，不规定走哪条路。
— 第1:23–1:46段
⬇︎ 下载 PNG 𝕏 分享到 X
文本已成为新状态：一个深研Agent返回的计划可被用户用自然语言追加约束（如‘聚焦美国市场但忽略加州’），而传统系统只能拒绝并要求重提需求。
— 第2:06–2:58段
⬇︎ 下载 PNG 𝕏 分享到 X
你可能见过编码Agent做出奇怪操作却最终达成目标——这正是我们想要的：非确定性路径下的可靠结果，而非僵化的步骤执行。
— 第1:56–2:03段
⬇︎ 下载 PNG 𝕏 分享到 X

#AI Agent#大语言模型工程#软件范式转变#提示工程

视频笔记

translation:

YouTube Transcript

language: English ( automatically generated) (en)

[0:07] [music]

[0:14] >> okay, cool. Awesome. Hi everyone.

[0:17] My name is Philip. I work at Deep Mind

[0:19] everything related to agents on Gemini

[0:21] or Gemini API. So if you have some

[0:23] questions afterwords, some concerns,

[0:25] some bugs, some issue, please let me

[0:27] know.

[0:28] We're going to talk today 10 minutes

[0:29] about why engineers struggle to build

[0:32] agents and I see this every day

[0:34] internally at Google but also externally

[0:36] at Google and I brought five example on

[0:38] like what's really different to how we

[0:40] built traditional software a few years

[0:42] ago and to now how we build agents. And

[0:45] if we like on a high level compare them,

[0:47] right? When we wrote software, we

[0:49] created a PRD, a PRD,

[0:52] wrote code, sometimes created tests to

[0:54] make sure our code works. We deployed it

[0:57] and then our user used it.

[0:59] When building agents, things are a little

[1:01] different. We define instructions on

[1:03] what we want our agent to do. We run it,

[1:06] we observe what it does, we may adjust

[1:08] our prompts, may adjust our tools.

[1:11] We run it again and we have like this

[1:13] iterative loop of how can we improve and

[1:16] make our agent way more reliable, which

[1:18] is very different to how we build

[1:20] software. And like something I like to

[1:22] compare it to is like traditional

[1:23] software is more like we acted as a

[1:25] traffic controller, right? We had

[1:27] control over the street lights or how

[1:29] fast you can go, which roads you can

[1:31] use, basic how the car drives. And

[1:34] now with agents, we are more of a

[1:36]Dispatcher. We tell the agent, " Hey, I

[1:38] want to go to London and I'm from like

[1:40] Germany. I could use the train, I could

[1:43] fly, I could use my car and go like

[1:46] under the water." And it's more about,

[1:47] " Okay, we define the goal on what we

[1:49] want the agent to do, but we don't

[1:51] define the exact step the agent needs to

[1:54] take to achieve that goal. And I mean

[1:56] every one of you has(probability seen in

[1:57] their coding agent that sometimes it

[1:59] does something very odd, but at the end it achieves the outcome. And that's what we want to do. So starting with the first example, text is our new state. I mean traditional we had data structures and everything was kind of mapped to])** or to like flags we could check. So initially when we created for example a deep research agent, deep research agent returns a plan to you, " Okay, I'm going to research this and that." In traditional software, we might have had an exact plan or denied plan, but we couldn't catch sematic meaning. And now what we have with LLMs is they can understand the sematic meaning. So for example, if I have a deep research request on like doing some market research, I can approaches the initial plan, but I can also on the same time provide additional information. So may I want to focus on like the US market and ignore California. May I want to provide something additional and not have like this multiple steps, right? Traditional it would(probability say steal steal and then it has a follow up. I might needed to provide more input, create a new plan and continue. And another good example is everything related to memory and personal personalization we cannot real be mapped to data structures, right? The example I here have is like I'm from Europe, so I most use Celsius, but what if I would like to use Fahrenheit for cooking, right? previously we might have had some flags on like the user profile is Celsius or is Europe or use Fahrenheit, but I couldn't like dynamic adjust based on the user preference, based on what I provide. So real it's all about text and context. I mean could be images, video, audio as well, but we no longer are real working in those clear structures data concepts.

[3:42] I mean could be images, video, video, but we no longer are real working in those clear structures data concepts.

[3:44] The other way is we should start hand over control and the the example which we might have from like previous customer support is like when a user reached out, " Hey, I want to cancel my subscription." I might have had]) classification model which kind of]) classified the意, " Hey, the user wants to cancel his subscription." And then I had a]) diarrยะ work of " Hey, do you want to sell it? Do you cancel the subscription?" But there was no like dynamic kind of option to react to it dynamic. And may instead of like-highlight or going through the subscription cancel flow, what if your agent like kind of]) understand the meaning and like offers something dynamic to like the subscription and the user changes their mind and now you have like a whole different意. And it's very hard to model all of those])** untranslated translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation translation

translated marked down to:

[4:47] differences and uniqueness and to to

[4:49] like all of those

[4:51] stateful workows we had before. So we

[4:54] need to like trust into the LLM or like

[4:57] hand over control that we are no longer

[4:59] working in those purely deterministicums

[5:02] environments.

[5:04] The third one is errors are just inputs.

[5:07] So if something in your agent flow

[5:10] fails, we need to treat it as a normal

[5:14] input as very similar to a user input.

[5:16] In Go, we already do this, right?

[5:19] A function call can be an error or can

[5:21] be a value and we treat them kind of

[5:23] equally. And we have to do this for

[5:25] agents very similarly. In the past, HTTP

[5:27] requests were very cheap. When some

[5:30] search, some product search failed, you

[5:32] just rerun your request, you redid all

[5:34] of the work, which was okay. But now if

[5:37] you have like an agent which takes 5

[5:39] minutes, 15 minutes and something in the

[5:42] flow breaks and you would start all all

[5:45] over, you would need to spend

[5:47] you need to spend a lot of compute again

[5:50] to like do all of the previous steps.

[5:52] And you also might lose the existing

[5:54] context. So we cannot like just start

[5:56] over the whole process. We need to kind

[5:58] of understand and treat errors

[6:00] differently, provide them back to the

[6:02] model, may have some other workarounds, some additional checks that

[6:04] we basicity understand and treat errors

[6:06] different way. And we always

[6:08] need to trace what the agent is doing,

[6:11] but we need to create on the output. We

[6:13] May the agent decides for like one

[6:16] user it needs to do like four more steps

[6:19] to do more research than for the other

[6:22] user.

[6:24] It consumes may a few more tokens, but

[6:27] at the end the outcome is real what we

[6:29] need to measure and want to measure in

[6:31] terms of success.

[6:34] And then the last part is

[6:37] agents evolve and apis don't. And

[6:41] if you have worked on the behind and if

[6:44] you have built an API,

[6:46] you might have seen a lot of methods,

[6:48] API endpoints which feel very

[6:51] self-explaining to you like delete item

[6:54] feel very self-explaining if you are

[6:56] working on like the product API. But an

[6:58] agent doesn't see the code, an agent

[6:61] doesn't have the context and the

[6:64] background from all those years from you

[6:66] working on the API. So we need to build

[6:69] apis or tools which are real agent

[6:72] ready, which are self documentable with

[6:75] sematic interfaces. I would assume if

[6:78] you have like a product microservice and

[6:81] you have a delete item endpoint with an

[6:84] ID, you don't need to like define a doc

[6:87] string what the ID is or what happens if

[6:90] something fails. But our agents only see

[6:93] like the function schemas and the doc

[6:96] strings and the tool definition. So on

[6:98] the first look, they don't real see

[6:101] what the delete item method does. That's

[6:104] why we need to real adjust to, " Hey,

[6:107] we need methods, tools which are written

[6:110] for agents to be used and not assume

[6:113] long year开发 expertise and people

[6:116] who have built the API.

[6:119] So to to sum up everything, we need

[6:122] to give trouble, but we also verify. We

[6:125] should not like try to force the model

[6:128] into this one specific work flow with

[6:131] step one do this, step two do the other

[6:134] way. We need to preserve meaning.

[6:137] everything is a context now. We no longer have those very well defined data structures for all of our applications.

[6:140] We need to design for recovery. models are not perfect. agents are not perfect, especially if we have longer running agents. There will be some very odd things happening. So you need to design for])])]))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

[9:42] then don't only assert. agents are not [9:45] 100% reliable. We need to find the right [9:48] balance between how many times our run [9:50] need to be successful to provide it to [9:52] the user. And last but not least, build [9:55] to delete. Um the bitter lesson is what [9:59] everyone of us is learning is like [10:00] software is disposable. We are going to [10:03] build many, many times the same things [10:05] with better models, better agents. [10:08] Things will change. And [10:11] yes, [10:12] it's also available on my blog. So if [10:14] you want to like look a bit deeper with [10:16] some code examples. And if not, if you [10:18] have any questions, feel free to to [10:19] reach out to me and [10:21] perfect on time.Thanks. [10:32] >> [音乐]