T
traeai
Sign in
返回首页
Martin Fowler

Maintainability Sensors for Coding Agents

7.5Score
Maintainability Sensors for Coding Agents

TL;DR · AI Summary

Martin Fowler discusses using various sensors to monitor and improve the maintainability of a codebase, focusing on functional correctness, architectural fitness, and internal quality.

Key Takeaways

  • Maintainability involves making it easy and low-risk to change the codebase over
  • Various sensors such as type checkers, ESLint, Semgrep, and test suites can help
  • Continuous integration and repeated checks are essential for detecting long-term

Outline

Jump quickly between sections.

  1. Introduce the concept of maintainability and its importance.

  2. Describe an internal analytics dashboard application scenario, including the technology stack and sensor usage.

  3. Introduce different types of sensors running at various stages of the production process.

  4. Detail basic code checking and static analysis tools used.

  5. Discuss real-time feedback and dynamic monitoring methods in continuous integration.

  6. Summarize the value of sensors in improving code quality and maintainability.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • 代码维护性传感器
    • 应用背景
      • 技术栈
      • 传感器使用
    • 传感器概述
      • 开发会话期间
      • 集成到管道后
      • 重复运行
    • 静态代码分析
      • 基本 linting
      • 静态分析工具
    • 动态监控
      • 实时反馈
      • 持续集成
    • 总结
      • 传感器的价值

Highlights

Key sentences worth saving and sharing.

  • Maintainability involves making it easy and low-risk to change the codebase over time.

    Paragraph 1

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Various sensors such as type checkers, ESLint, Semgrep, and test suites can help monitor code quality and maintainability.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Continuous integration and repeated checks are essential for detecting long-term issues.

    Paragraph 4

    ⬇︎ 下载 PNG𝕏 分享到 X
#code quality#maintainability#sensors
Open original article

URL 源: https://martinfowler.com/articles/sensors-for-coding-agents.html

Markdown 内容: 我们通常希望在代码库中实现并监控多个维度:功能正确性(按预期工作)、架构适应性(足够快/安全/易用)以及可维护性。我在这里定义的可维护性是指随着时间推移使代码库易于更改且风险低——也称为“内部质量”。因此,我不仅希望今天能够快速做出更改,而且将来也能做到。我也不想每次更改时都担心引入错误或适应性的下降——或者让 AI 进行更改。我通常会在代码库生成的 AI 代码的可维护性出现裂缝时看到第一个迹象,例如为了一个小调整而更改的文件数量增加,或者更改开始破坏之前正常工作的功能。

内部质量问题以类似的方式影响 AI 代理,就像它们影响人类开发者一样。在一个纠缠的代码库中工作的代理可能会在错误的地方查找现有实现,因为它没有注意到重复项而创建不一致性,或者被迫加载一个任务所需的上下文过多。

在这篇文章中,我描述了我在各种传感器上所做的实验,这些传感器帮助我们和 AI 反思代码库的可维护性,并从中学到了什么。

应用程序

我正在开发一个内部分析仪表板,该仪表板由社区经理使用,读取来自多种 API 的聊天空间活动、参与度和人口统计数据,并在 Web 前端展示数据。

图 1:显示应用程序前端、后端和 4 个外部 API(Google Chat、Google People、员工 API 和 Gemini API)的概述

图 1:示例应用:Web UI、服务层和外部 API。

技术栈包括 TypeScriptNextJSReact。后端从 API 中读取和连接数据。该应用程序已经存在一段时间了,但为了这些实验,我从头开始用 AI 重新构建了它。

关于 AI 对代码质量和可维护性的指南几乎没有(例如 Markdown 文件),我想看看它仅仅依靠传感器反馈能做得有多好。

所有使用的传感器概览

图 2:传感器概述:在编码会话期间、集成到管道中、定期运行和生产中的运行时反馈

图 2:传感器可以运行的位置:初始编码会话期间、在管道中、按计划运行和在生产中。

这是我在通向生产的路径上设置的所有传感器的概述。

在编码会话期间

与代理连续运行的传感器,提供快速反馈。

  • 类型检查器(计算)
  • ESLint(计算)
  • Semgrep,由我们内部 AppSec 团队推荐的 SAST 工具(计算)
  • dependency-cruiser,运行结构规则来检查内部模块依赖关系(计算)
  • 测试套件结果,包括测试覆盖率(计算——尽管测试套件是由 AI 生成的,因此是通过归纳方式创建的)
  • 增量突变测试(计算)
  • GitLeaks 在提交前钩子中运行,我认为它也是一个传感器,因为它会在代理尝试提交时提供反馈(计算)

集成后——管道

相同的计算传感器再次在 CI 中运行。会话中的传感器在开发过程中给代理早期反馈。CI 管道在干净的基础架构和集成后确认结果。

定期运行

以较慢的速度运行的传感器,用于检测随着时间积累的偏差,而不是瞬间发生的错误。

  • 安全审查,提示来源于我们的 AppSec 内部应用清单(归纳)
  • 数据处理审查,提示描述了诸如“用户名永远不应发送到 Web 前端”之类的事情(归纳)
  • 依赖新鲜度报告,首先运行脚本来获取库依赖项的年龄和活动,然后由 AI 创建一份包含潜在升级、弃用等建议的报告(计算和归纳)
  • 模块化和耦合审查(计算和归纳)

有了这个背景知识,让我们深入探讨第一类传感器。

基础框架和模型

在整个构建应用程序的过程中,我使用了 Cursor、Claude Code 和 OpenCode(按频率顺序)。我的默认模型通常是 Claude Sonnet,在一些规划和分析任务中我使用了 Claude Opus,在实现任务中我经常使用 Cursor 的 composer-2 模型。

静态代码分析:基本校验

我将从在本应用程序中使用 ESLint 的学习经验开始。像 ESLint 这样的基本校验工具主要针对单个文件和函数层面的可维护性风险。

针对典型 AI 缺陷的规则

在我的经验中,静态代码分析中最容易解决的 AI 失败模式是:

  • 函数的最大参数数量
  • 文件长度
  • 函数长度
  • 莫德莱尼茨基复杂度

然而,这些规则甚至不在 ESLint 的默认预设中,我必须先配置它们的最大值。希望静态分析工具会进化出更适合与 AI 使用的更好预设。一些研究显示,人们也开始发布专门针对已知代理失败模式的 ESLint 插件,例如 Factory 的插件,其中包含有关需要测试文件或结构化日志等规则。

自我纠正的指导

markdown
A sensor is designed to provide the agent with feedback so it can correct itself. Ideally, we want to offer the agent additional context for that correction—a beneficial form of prompt injection. To achieve this, I developed a custom ESLint formatter to override some default messages—of course, with the assistance of AI.

Here’s an example of my guidance for the `no-explicit-any` warning.

We aim for typing to simplify error avoidance, especially for core concepts.
But we also wish to prevent cluttering our codebase with unnecessary types. Make a judgment
call on this. If you decide not to introduce a type, suppress it with:
// eslint-disable-next-line @typescript-eslint/no-explicit-any -- (provide reason),
### Making warnings more manageable?

Static code analysis has existed for a long time, yet teams often inconsistently used it, even when it was set up. One reason for this is the administrative burden associated with it. Effective use of this analysis requires a team to maintain a "clean house," otherwise, the metrics simply become noise. In particular, warnings like the `no-explicit-any` example above are challenging because you don’t always want to fix them—it depends. Suppressing them one by one feels tedious and like noise in the code.

With coding agents, we might now have a chance at achieving that clean baseline. In the guidance text above, the agent is instructed to make a judgment call and allowed to suppress a warning in the code. This keeps the suppressions manageable, visible, and reviewable.

For thresholds, such as the maximum number of lines or the maximum allowed cyclomatic complexity, I advised the agent in the lint message that it may slightly increase the thresholds if it believes that a refactoring is unnecessary or impossible in a specific case. This doesn’t suppress the threshold permanently; it merely increases it, so that the rule will fire again if it worsens in the future. Constraints are preserved without forcing a binary suppress-or-comply choice.

### Observations

*   Reviewing the exceptions AI created (suppressed warnings, increased thresholds) was a good starting point for my code review.
*   AI frequently decided to increase the cyclomatic complexity threshold but suggested good refactorings when I prodded it further. It was the only category where it did that, and I later found out that I didn’t have a self-correction guidance in place for this, so there was no explicit instruction saying that a threshold increase should be the absolute exception. This indicates that custom lint messages can indeed make a significant difference.
*   Sometimes, I want to handle rules differently in various parts of the code. For instance, with `no-console`, I reprimand AI when it uses `console.log`. In the backend, I prefer it to use a logger component instead. In the frontend, I might want to avoid direct logging altogether or at least use a different logging component. This is another example of the power of self-correction guidance, and where AI can assist with semantic judgment and management of analysis warnings.
*   I kept an eye out for examples of trade-offs between rules. The only one I’ve encountered so far was created by the `max-lines` and `max-lines-per-function` rules. I’ve seen AI perform quite a bit of useful refactoring and break down into smaller functions and components as a result of this sensor feedback. However, in the React frontend, I’m observing a concerning trend of components with numerous properties due to passing values through a growing chain of smaller components. I haven’t yet gained useful insights into how well AI might handle consistent decision-making between such trade-offs.

### Key Takeaways

Overall, I was pleasantly surprised by how many things I can cover with static analysis. I had to remind myself repeatedly why it has been somewhat underused in the past, and what has changed: the cost-benefit balance. Costs are reduced because it’s much cheaper to create custom scripts and rules with AI. Benefits have also increased: the analysis results help me gain a preliminary understanding of numerous hygiene factors that wouldn’t occur as frequently when writing code myself, allowing me to address common AI mistakes upfront.

However, I can’t help but wonder if this might also lead to a false sense of security and an illusion of quality. After all, another reason why linters like this have been less used in the past is that they have limitations, and we have been wary of using them as a simplified indicator of quality. Static analysis cannot catch many semantic aspects of quality, and it remains to be seen if AI can adequately fill that gap in partnership with those tools. I also noticed new supposed issues in the code each time I activated a new set of rules. It was always a mix of irrelevant things and actual matters. So, I worry about feedback overload for the agent, potentially leading it into a cycle of overly engineered refactorings.

## Static code analysis: Dependency rules

Basic linting primarily focuses on quality and complexity within a file or function. Next, I began exploring sensors that could provide feedback to me and the agent regarding maintainability concerns that span across files and modules. Analysis tools in this area have historically been even less used than basic linting.

To understand the potential of sensors that can help us and AI maintain good modularity within a codebase, I investigated three areas:

*   Dependency rules (deterministic)
*   Coupling analysis (deterministic and inferential)
*   Modularity review (inferential)

Let’s start with dependency rules. I collaborated with the agent to develop a layered module structure for my application, halfway through its implementation. I asked it to help me write [`dependency-cruiser`](https://github.com/sverweij/dependency-cruiser) rules to enforce these layers.
Image 3

Figure 3: Layered module structure and dependency rules

For example, one of the rules enforces that code in the clients folder never imports anything from the services folder:

json
{
  "name": "clients-no-services",
  "comment": "API clients must not depend on the orchestration layer above them. " + LAYERS,
  "severity": "error",
  "from": { "path": "^server/clients/", "pathNot": "/__tests__/" },
  "to": { "path": "^server/services/" }
}

As with the ESLint messages, I also expanded the error messages a bit to be self-correction guidance, recapping the layering concept as a whole:

ERROR clients-no-services API clients must not depend on the orchestration layer above them. [Layers: routes -> services -> clients + domain; Services orchestrate: fetch data via clients, compute via domain -- no I/O, no SDKs, no knowledge of data fetching.]

Observations

  • Without AI, I would not have gotten these rules in place quickly. The tool's configuration syntax has a steep entry cost, and AI absorbed that cost almost entirely.
  • The agent violated the rules a handful of times after I introduced them, and then self-corrected based on dependency-cruiser feedback, so it did help keep my folder concepts.
  • I also used the same approach to introduce conventions for how React hooks should be structured in the frontend.
  • I had to figure out how to catch things when AI starts creating new folders outside of this structure, with a rule that requires every new file to be somewhere in the predefined folder structure.

Main takeaways

At the point when I introduced these rules, the structuring of code into folders had already become a little bit haphazard. I could see how the rules helped the agent clean that up, and then continue enforce these layers going forward. So I've found it quite a useful replacement for describing code structure in a markdown guide. However, tools like this are limited to what is expressible via imports, file names, and folder structure.

Static code analysis: Coupling data

Next, I experimented with the extraction of typical coupling metrics from my codebase, i.e. the number of incoming and outgoing imports and calls per file.

I didn't use any existing tools for this, instead I had a coding agent write an application that creates those metrics with the help of the typescript compiler, so that I could have maximum flexibility to play around with this as part of my experimentation. I had it add two interfaces: A web interface with a bunch of different visualizations of those metrics for my own human consumption. And a CLI that can provide those metrics to a coding agent.

Image 4

Figure 4: Coupling metrics: web visualizations and CLI for agents.

For human consumption

Most of these visualizations are well established concepts, like a dependency structure matrix (DSM). I found them tedious to interpret, and even though they were vibe coded and could most certainly be improved, I think that had more to do with the nature of the data. It's quite detailed data that needs a lot of context and experience to interpret it, and map it back to more high level good practices. So I have a feeling that these types of tools still won't really help reduce a human's cognitive load much when reviewing codebases that were changed by AI.

For AI consumption

I gave an agent access to this custom CLI (coupling-analyser) and asked it to create a report based on the data, including suggestions of how to improve the critical issues.

Here is an excerpt of what that prompt looked like - I'm mainly reproducing this to show you that I didn't actually give it much guidance on what good or bad modularity looks like, I mostly delegated to the model to interpret what good and bad looks like:

Produce a markdown report on modularity and coupling quality for the target TypeScript codebase, grounded in actual CLI output from npx coupling-analyser, not guesswork from static browsing alone.

Gather evidence (run the CLI)

Execute the CLI and capture stdout. Use the report subcommands—combine as useful for the question: …

Write the markdown report

Use clear headings. Prefer concrete module IDs / paths and numbers quoted or paraphrased from CLI output.

Suggested sections:

  1. Context — What was analyzed
  1. Executive summary — 2–5 bullets: overall modularity posture, top 1–3 systemic issues.
  1. Findings from the tool — Summarize hotspots, top risks, notable cycles or mutual dependencies, and behavioural highlights as reported by the CLI.
  1. Interpretation (modularity lens) — Tie metrics to software design: cohesion vs. spread of change, stability vs. dependency direction, fan-in/fan-out intuition, cycle impact.
  1. Deep dives for each high and critical issue
  • What it is — Module(s), role in the system, dependency neighbours (from CLI + minimal code peek if needed).
  • Responsibilities today …
  • Why it hurts …
  • Design options (2+ where reasonable) …
  • Why the new design is better — Fewer cycles, clearer dependency direction, smaller surfaces, test seams, align with likely change vectors.
  • Future change risk — How each option reduces regression risk and makes safe evolution cheaper (concrete scenarios: “adding X”, “swapping Y”, “shipping Z independently”).

This LLM-led analysis actually pointed me to the same coupling hot spots that I would have found by looking through the visual diagrams, just in a format that was more digestible. And asking the LLM to ground its analysis in the results from the deterministic tool gave me a higher level of confidence, and probably also used less time and tokens than if the agent had scanned the codebase itself to find coupling problems.

Observations

markdown
What the LLM found based on this data was quite lackluster (I used Claude Opus 4.7 for this):

*   It said one of the biggest issues was a factory that initializes all the necessary components, but I had introduced that factory on purpose as a component that acts like a lightweight dependency injection framework.
*   Another issue it had was with a shared (`zod`) schema between frontend and backend, which the LLM referred to as a “god module.” This is a common pattern to create an explicit contract between backend and frontend, and it is not as much of an issue when both evolve together or live in the same repository, as in my case.
*   When legitimate patterns appear as high-coupling hubs, there needs to be a way to suppress them in future analyses, otherwise they create even more noise.
*   One interesting finding it had: An `index.ts` file in the domain folder indiscriminately exposes all files in `./domain` and is imported by many places. While this is also a common pattern to create explicit contracts for a layer, it has its pros and cons, and it is at least worth investigating whether it is appropriate for this codebase.

### Main Takeaways

The examples above show that even more so than with basic linting, _good_ and _bad_ do not have a clear definition; instead, it is all about what is _appropriate_. What coupling is appropriate depends on a lot of context, not just the raw call and import graph of a codebase. Based on this small experiment, I don't have the impression that this type of coupling data is useful to AI on its own.

A more practical use I can imagine for this data is during risk triage for code reviews. When reviewing a code change made by AI, it seems useful to know the impact radius of the changed files so that you can pay more attention when, for example, a file with 10+ callers is modified. Alternatively, an AI review agent could use the data to prioritize where it spends its tokens.

## Static Code Analysis: AI Modularity Review

The lackluster results from the coupling data experiment could have multiple reasons:

*   My prompt about what to analyze was not very specific.
*   The coupling data might not be useful to AI.
*   The coupling data is too shallow and lacks context of the full code.

So, the final thing I did was to go fully down the inferential route and use [Vlad Khononov's “Modularity Skills”](https://github.com/vladikk/modularity) to analyze the codebase design and identify modularity issues. This proved to be very fruitful! It provided me with numerous interesting pointers for refactoring that would obviously reduce the risk of future changes. I ran the skills a second time and gave them access to my coupling analysis CLI. The AI mostly confirmed the data but did not find any additional insights. On the contrary, it highlighted several points that the CLI missed. It's also worth noting that the second run of the analysis (without context from the first one) uncovered another issue that the first run did not detect. This serves as a useful reminder that when it matters, it's often worth running an LLM-based analysis multiple times to get a fuller picture.

### Observations

Here are some highlights from the results (the model used was Claude Opus 4.7, the same as for the coupling analysis):
  • 重复的路由代码——我的三个后端端点各自都有自己的路由文件,每个路由实现几乎都相同。因此,每当我想对后端API的一般原则进行更改(例如引入请求ID或更改错误处理或日志方法),我必须在多个文件中进行修改。我刚刚引入了第三个端点,所以我认为这还不足以抽象出来。但在我经验中,AI代理通常不会在没有明确提示的情况下开始重构,当他们第三次或第四次重复一段代码时,他们会很乐意复制粘贴。
  • 调用后端的不一致性——或者换句话说,另一种语义上的重复。应用程序中有三页需要以相同的参数集(选定的聊天空间和分析的时间范围)调用后端。其中两页使用了相同的钩子和一般方法来实现这一点,但当AI引入第三页时,它偏离了这一做法,并以自己的方式重新实现了类似的行为。这可能会导致错误处理不一致,或者再次需要在后端API原则改变时更改多个文件。
  • 核心参数的低效处理——正如刚才提到的,应用程序中的所有页面都将聊天空间ID和时间范围传递给后端。当我改变用户指定时间范围的方式时,我已经注意到AI为了这个变化改变了大量文件——超过40个!所以我已经意识到这里有些问题,分析结果也证实了这一点:“问题:请求参数在每一层重复”。建议是引入一个对象来包装这些参数。AI已经在某种程度上做了这一点——但它从未完全遵循该对象的使用,所以这是一个不一致的混乱。
  • 职责分配不当——审查发现工厂内部有一小部分认证代码,本应仅负责模块的连接。它在用户未认证时实现了一个回退到模拟数据的功能。这种意想不到的位置会增加新路由添加时被遗漏的风险。
  • 更好地解释可接受的高导入计数“中心”——还记得我之前的耦合分析中找到的“神类”吗?模块化技能也注意到了这些问题,但在两种情况下都非常优雅地指出它们在这个应用程序上下文中有其用途。我认为这是由于这些技能的良好提示,或者是由于这次分析实际上读取了代码的内容,而我让另一个只依赖耦合数据。

主要收获

  • 依赖解析器如 dependency-cruiser 可以有效地作为一些基本文件结构和依赖方向的实时传感器,但它们的作用有限。
  • AI模块化审查是一个很好的“垃圾回收”示例,在给予强大提示时效果很好。将其基于实际耦合数据并没有太大区别。最好能找到一种方法将此应用到提交中的更改文件,以便更早地在管道中进行,但我还没有探索这种方法。
  • 我在构建大部分代码库后运行了模块化审查,而没有自己进行那种类型的审查——它有一些相当令人担忧且非常有效的发现,这些发现未来会增加风险。这表明如果没有人类审查和耦合专业知识,再加上这些额外的AI审查,代理肯定会累积无意的技术债务

总体而言,代码库设计和模块化似乎是一个计算传感器无法帮助我们太多的问题,AI是添加语义解释并考虑权衡的必要条件。

测试套件作为回归传感器

测试有许多目的——它们帮助我们思考和推动设计,它们记录了应用程序所需的行为(它们是最终规范!),并且它们帮助我们检测回归,即告诉我们当我们通过更改破坏现有功能时会发生什么。有效的回归测试在代码库的可维护性方面起着重要作用,使它更加安全地进行更改。因此,在维护传感器的背景下,本节讨论测试套件作为回归传感器的角色。

当一个已有的测试失败时,我们必须问自己一个问题:“我是不小心破坏了某些东西,所以需要更改我的实现吗?还是我有意改变行为,所以测试必须改变以适应新的规范?”一个失败的测试给了AI提出这个问题的机会。当然,它并不总是能做出正确的决定!但是一个好的测试套件可以降低AI意外破坏所需行为的概率。

在我的聊天分析应用程序中,我让代理随着时间推移编写所有的测试,除了手动测试和关注测试覆盖率之外几乎没有监督。我希望有一个完整的AI生成的测试套件,以便事后分析其回归有效性。

使用AI生成测试而不进行审查的两种主要风险:

  • 覆盖率不是测试有效性的充分指标
  • 测试可能正在测试有缺陷的行为——这是一个比检查测试有效性更难的问题,而且是一个将来的话题。本文专注于测试的有效性,即假设我们的代码实现了所需的行为,我们是否有测试能够捕获破坏性代码。

我们工具箱里有什么?

  • 覆盖率 ($) — 跟踪测试执行的代码部分,指示哪些部分对测试可见,哪些部分不可见。
  • 基于属性的测试 ($) — 通过从定义的属性生成许多输入组合,而不是手工编写示例,可以发现缺失的逻辑测试用例。
  • 模糊测试 ($$) — 通过向系统抛出意外或格式错误的输入,可以发现输入鲁棒性方面的缺失测试用例。
  • 变异测试 ($$) — 通过引入小的代码变异并检查测试套件是否捕获它们,可以发现缺失的断言。

在我的应用中,我使用了覆盖率和变异测试,因为基于属性的测试和模糊测试并不适合我的应用场景。

变异测试

这是来自我代码库的一个小例子,以说明变异测试如何帮助我们找到断言中的缺口。代理在我分析变异测试结果时为我创建了这个图表:

图 5:映射器及相关代码的变异测试分析图

图 5:代码库中的变异测试示例。

mappers.ts 文件报告了 100% 的语句覆盖率和 75% 的分支覆盖率——但它实际上没有单元测试,Stryker(我使用的变异测试工具)报告有 13 名幸存者(即在 Stryker 的 13 次代码变异后,测试套件仍然绿色)。在这种情况下覆盖率很高,因为代码库有一个大的接受测试,最终调用了这些函数——覆盖率告诉我们某行被执行了,但并没有验证其影响。如果未来的某个小映射助手函数 dvpToSchema 发生变化,可能会潜在地破坏 UI 中的数据图显示。

观察

  • AI 在分析变异热点和制定优先级计划以提高测试质量方面非常有帮助。
  • Stryker 将结果写入一个巨大的 JSON 文件。为了帮助分析并避免不小心阻塞上下文窗口,我生成了一个自定义脚本来帮助代理高效查询 Stryker 的结果。这只是许多 AI 帮助我帮助 AI 的例子之一。
bash
命令行查询 Stryker 变异测试 JSON 报告。

用法:
python query_stryker.py <report.json>; <command> [选项]

命令:
   summary 总体状态总计,变异得分,阈值。
   files 每文件细分,默认按变异得分升序排序。
   hotspots 最多幸存者/无覆盖突变的行。
   tests 测试有效性:弱、未使用或顶级杀手测试。

示例

# 1. 全局健康状况——变异得分、状态细分、阈值通过/失败
python ./query_stryker.py reports/mutation/mutation.json summary

# 2. 首先是最差的文件,并带有操作提示(加强断言或添加测试)
python ./query_stryker.py reports/mutation/mutation.json files --top 10 -v

# 3. 同上,但仅限于你在 Git 中更改的文件(自动检测仓库)
python ./query_stryker.py reports/mutation/mutation.json files --changed -v

# 4. 放大到一个文件:每行(可操作计数、样本突变器)
python ./query_stryker.py reports/mutation/mutation.json hotspots --file server/services/ai-summaries.ts --top 30

主要收获

目前看来,似乎有一种趋势转向更端到端的样式接受测试。正如开头提到的,AI 已经变得非常擅长生成测试,因此开发者通常会让 AI 生成大量测试,而不需要太多审查。特别是审查单元测试可能非常繁琐。我不是说完全不看它们不好——但我承认人类审查所有测试是不现实的,人们也不会真的去做。所以,在寻找 AI 编码未来适当的测试金字塔/冰淇淋锥/马芬形状的同时,批准场景等技术正在变得越来越流行。正如上面所展示的,接受测试增加了覆盖率,但通常不是非常注重断言,给我们一种测试有效性的虚假安全感——变异测试帮助我们监控这种差距。

当然,变异测试也有实际限制:它相当资源密集型。在我的设置中,我没有像一些其他传感器那样持续运行它(就像我其他的一些传感器),而是手动触发增量运行。

结论与开放问题

计算传感器在文件和函数级别给我留下了最深刻的印象。跨文件关注点如模块化和耦合是一个不同的故事,原始数据本身非常嘈杂,没有语言模型的语义解释就不太有用。但我在良好的提示下能够从那里获得输出和建议,并且也欣赏了以不同方式呈现这些信息的可能性,以适应不同的经验水平。

在我的实验中我没有看到,但怀疑可能会成为一个问题的是 _传感器之间的冲突_。最大行数和每个函数的最大行数规则显示出一些紧张迹象,将重构为越来越小的函数将复杂性推入组件属性链中。类似这样的权衡可能还隐藏着,随着时间的推移,这是否会成为一个问题也将很有趣。

在这个应用中,我没有使用任何 _指南_,纯粹是为了看到传感器的效果。我对平衡指南和传感器的发展感到好奇。一旦我们对自己的传感器集有信心,我们可以删除哪些指南?传感器使使用较弱模型更加现实吗?我们如何保持指南和传感器的一致性,并且是否能找到某种方法将它们捆绑在一起,使其更容易维护?

In the regression testing area, my eyes have really been opened to how crucial mutation testing becomes when we make the decision to leave most of the testing to AI... And I want to stress once more that there is a whole other conversation to be had about the correctness of tests!

While some of these sensors really do increase my trust in the quality of the outcomes, they are not a magical solution to take the human totally out of the loop. But I definitely experienced an improvement in my review experience and trust level with both computational and inferential sensors as my partners.

AI may generate inaccurate information. Please verify important content.