Building Self-Improving Tax Agents with Codex
TL;DR · AI Summary
How Thrive Holdings and OpenAI co-developed Tax AI for Crete accountants by fusing practitioner expertise with a Codex-driven loop
Key Takeaways
- Tax AI uses Codex to turn production use into structured signals that fuel auton
- Tax AI reduced tax preparation time by half over six months and achieved higher
- By leveraging expert feedback, production traces, and a Codex-driven iterative l
Outline
Jump quickly between sections.
Introduce the background and goal of Thrive Holdings and OpenAI collaborating to develop Tax AI for Crete accountants.
Detail how Tax AI uses Codex to automate the tax preparation process and improve efficiency.
Explain how Tax AI achieves self-improvement through expert feedback, production tracking, and a Codex-driven iterative loop.
Summarize the results of Tax AI and explore its potential applications in other fields.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Tax AI 自改进
- Codex 驱动的迭代循环
- 专家反馈
- 生产跟踪
- 结果
- 时间节省 50%
- 准确率提升
Highlights
Key sentences worth saving and sharing.
Tax AI processed 7,000 tax returns, saving half the time and achieving higher accuracy compared to its initial deployment.
Through a Codex-driven iterative loop, Tax AI continuously improved, addressing complex production failures effectively.
Tax AI demonstrated higher accuracy and efficiency in handling complex files such as K-1 tables and rental property schedules.
URL 源: https://openai.com/index/building-self-improving-tax-agents-with-codex
Markdown 内容: _如何通过融合实践家的专业知识与 Codex 驱动的循环来共同开发 Thrive Holdings 和 OpenAI 的 Crete 税务 AI_
现实世界中的系统在生产环境中表现出的行为与实验室中不同,在部署前很难预料会出现哪些问题。团队通常会在发布后发现这些失败,然后花费数周时间检查边缘情况,调整提示,并将生产反馈转化为持久的产品改进。这个反馈循环是手动且缓慢的,只有当工程师推动它时才会有所改善。但今天,通过精心设计的评估基础设施、对实践家的直接访问以及 Codex 的前沿代理能力,你可以构建能够自我改进的代理。
在这篇文章中,我们将解释我们是如何使用 Codex 构建这种类型的代理的。在过去六个月里,OpenAI 前向部署了工程师和研究人员,与 Thrive Holdings 的工程师合作,共同为 Crete(opens in a new window) 的 30 多家会计事务所开发 Tax AI。Tax AI 帮助他们准备日益复杂的税表。与其依赖工程师逐一找到并修复每个错误,Tax AI 则利用 Codex 将生产使用转换为结构化的信号,从而驱动自主改进。
Crete 实践家每年处理数十万份税表,这需要处理数百万份底层文件。对于中等复杂度到大型复杂度的申报,数据录入本身可能需要每份税表八小时,经常涉及混乱的数据源、前一年的文件以及手动提取和计算。他们指出,在税收季最繁忙的时候,税务准备是一个显著的瓶颈。
为了解决这个问题,Tax AI 在今年税收季期间处理了参与试点的 Crete 公司的 7,000 份税表。该系统自动化了准备 1040 和 1041 税表的大部分耗时过程,但更令人印象深刻的是,该系统本身比三个月前首次部署的版本明显更好。
可衡量的自我改进
在 Tax AI 中,实践家上传原始文件以及任何客户特定的备注。Tax AI 然后创建一个税务引擎提交,等待审查。它节省了实践家约三分之一的时间用于税务准备,草拟出准确率达到 97% 的税表,并将吞吐量提高约 50%,使他们有更多时间与客户相处。
我们可以通过理解 Tax AI 能否在不需要后续修正的情况下完成税表的准确性来量化这种改进。我们通过检查有多少税表达到 75%、90% 或 100% 的正确字段填充率来测量准确性。在启动时,只有四分之一的税表达到了 75% 的正确字段填充率,但在六周内,这一比例达到了 86%。系统在 90% 和 100% 正确字段填充率方面也显示出更快的增长。这些阈值为我们提供了一个实际的视角,了解不同的税表仍然需要多少实践家的跟进。
早期,Tax AI 处理了一些简单的任务,如 W-2 和 1099 表。随着季节的推移,它转向了更复杂的税表,包括 K-1、表格和更难的边缘案例。每次新功能都比上次节省了更多的时间,因为它们承担的任务更复杂且更耗时。我们继续看到今天的持续进展。
接下来,我们将介绍我们的团队是如何通过依靠三个关键支柱来共同工程 Tax AI,使其具有自我改进的能力:1) 专家实践家反馈、2) 生产跟踪(从输入到最终输出的结构化历史记录),以及 3) 基于定制评估的 Codex 驱动迭代循环,以实现连续、更快的产品开发。我们希望我们的经验能对其他领域中有实践家专业知识至关重要的构建者有所帮助。
_随着 Tax AI 扩展到更复杂的申报,评分税表达到 75%、90% 和完整完成的比例在整个税收季继续上升。_
问题
当我们推进到更困难的税务准备部分(如 K-1、租赁房地产表格和需要在多个来源文件中核对价值的税表)时,很明显真正的挑战在于产品是否能将复杂的生产故障变得可见、可理解并可操作。
在产品的早期阶段,大多数修正都是手动进行的。实践家可以纠正系统错误,但产品未能捕捉到完整的上下文:一个更改的值在申报前可能反映了一次真实的提取失误、映射问题、缺失的产品支持或预期的工作流噪音。区分这些情况仍需工程团队的跟进。工程师可以使用编码代理,但系统尚未设计为能够在改进循环中有意义地使用 AI。我们没有足够的信号来确定应该攀登哪座山。
- Stay close to practitioners: The people doing the work need to steer what the product learns. Their intuition and understanding reveal which errors matter and help inform which parts of the workflow are worth focusing on next.
- Build the product so production creates evidence: The product has to capture more than just inputs and outputs; it needs to capture the full path from source material, to extracted fields and provenance, to downstream submission and expert correction.
- Create a Codex-driven improvement loop: Once production issues are visible and structured, they can become findings, tailored evals, and scoped engineering tasks. Codex can then help investigate, propose changes, validate them against targeted and regression evals, and move the product forward faster than a purely manual iteration cycle.
The rental properties example below shows how that loop works in practice, walking you through how a practitioner correction becomes a structured finding, then an eval target, and finally a Codex-scoped engineering task.
Rental property example
Rental property income is reported on Schedule E of an individual tax return. From an engineering perspective, the task of extracting it is simple to describe but hard to do well. The system has to read messy source material (handwritten notes, emails, spreadsheets, and other client files), extract the rental-property fields the system can confidently map to the tax engine, and preserve enough evidence that a practitioner can approve or correct the result. The simplified example below shows what those source files and extracted outputs might look like.
_A rental property source package is normalized into cited fields before those are mapped to downstream tax engine concepts._
1. A practitioner correction reveals a failure
A difference between the agent-predicted value and the actual value from the filed tax return might reflect a true extraction miss, but it could also be a practitioner preference, a value carried forward from a prior-year return in the tax engine, or a value introduced or changed elsewhere in the filing workflow. Practitioners helped us discern those cases so we could identify which actions required a practitioner correction or blocked a submission.
Because we could see these corrections in detail, we transformed the review process from a terminal, post-failure step into a continuous learning cycle. We designed the workflow to capture expert actions as structured data. Now, every intervention feeds the product's improvement loop by recording exactly what Tax AI proposed, what the practitioner modified, and what ultimately went into the filed return.
2. Product traces turn corrections into evals
For a complex workflow like rental properties, the system has to preserve what happens between the source files and the filed return. Along that path, documents are organized, split, and classified; rental-property fields are extracted with citations back to the source material; those values are mapped into the tax engine; and practitioners may still correct them before filing. Those product-level traces make it possible to investigate where a failure occurred. To turn practitioner corrections into useful evaluation targets, the system processes them in three steps:
- Capture the difference: Tax AI’s output is compared with the filed return to produce field-level review rows that capture the expected value, predicted value, and whether the difference appears actionable.
- Group related failures: Similar review rows are grouped to separate recurring product failures from expected workflow noise. For example, repeated practitioner corrections might show that Tax AI often misses fair-rental-day fields, mishandles “other expenses,” or confuses multiple rental properties across the same source package.
- Turn repeated patterns into eval targets: Once reviewed and measured, repeated findings become clear eval targets for Codex to improve.
_Rental property review rows separate recurring product failures from expected noise, then turn the actionable cases into evaluation targets that give Codex a hill to climb._
3. The finding becomes a hill to climb for Codex
The third pillar is creating an engineering loop capable of acting on these new evals. This is where Codex becomes central.
Suppose our eval pipeline flags that Tax AI consistently misses the "fair rental days" field, while practitioners reliably fill it in. Because this finding has already been packaged into a targeted eval set, with representative source packages and expected outputs, Codex can investigate the root cause directly within the product scaffold.
Codex isn’t working solely with a sub-par final output. It inspects the trace, eval, repo, and skills together:
- Investigate the pipeline: Inspect source packages, extraction schemas, mapper behavior, and code paths to determine whether the issue is an unsupported field, a missed extraction pattern, a source-selection problem, a mapper gap, or a grader issue.
- Implement targeted fixes: Extend the extraction schema, improve source selection for rental-property documents, update the tax-engine mapper, or refine the grader if expected workflow noise is being counted as a failure.
- Validate and propose: Rerun the targeted eval, run broader regression suites, and surface a candidate pull request for engineering review.
- Close the loop: Turn a recurring practitioner correction into a measurable engineering task. If the evidence is ambiguous or not safely automatable, the case routes back to the product team instead of being forced through the loop.
_The end-to-end self-improvement loop: production traces surface repeated field-level corrections, which become failure signals that Codex can inspect alongside the trace, evals, repo, and skills. Actionable patterns become bounded evals and candidate product changes; ambiguous cases route back to engineers for review. Each shipped improvement creates new production evidence for the next cycle._
## How to use Codex to build this loop
The rental property example is emblematic of a broader reusable pattern: using production artifacts and traces to improve an agent’s capabilities. Given reviewed findings from production data, source traces, expected tax-engine output, relevant code examples, and eval commands as a set of inputs, Codex can materially improve on performance and accuracy over weeks and months. This builds on the principles described in our work on [harness engineering](https://openai.com/index/harness-engineering/) and [Symphony](https://openai.com/index/open-source-codex-orchestration-symphony/), which walk-through how to make tasks legible to Codex, provide scoped context and tools, and keep validation and human review part of the environment.
That evidence does not become a Codex task automatically. A practitioner correction may reflect an extraction miss, a mapping issue, unsupported product behavior, tax judgment, or expected workflow noise. Only after repeated differences have been reviewed and grouped into an actionable finding does the system turn them into a bounded task with a clear success condition.
We apply this automation to a bounded layer of the product. This layer performs extraction and maps source documents into tax workflows. Engineers remain responsible for architecture, product decisions, and shipping. Practitioners steer the improvement loop through the work they already do: correcting extracted values, reviewing returns, and approving final filings.
For Codex, the result is not a vague alert but a scoped engineering task with evidence, editable product surfaces, and explicit validation gates. The context for a representative rental property task can be summarized as follows:
A bounded Codex task environment separates the writable worktree [1] from read-only production context [5]. The worktree contains the scoped product surface Codex can inspect or modify [2], the targeted and regression evals that define success [3], and reusable skills/docs that encode how to run the task and respect prior decisions [4]. The read-only context provides the production trace, source documents, Tax AI prediction, finalized return, and tax-engine field documentation, so Codex can investigate the failure without mutating the underlying evidence.
## Expanding to new domains
The same loop applies beyond rental properties. Rental properties took about six weeks and substantial engineering oversight to reach 90% precision and recall, but that work produced reusable abstractions, review artifacts, eval conventions, and implementation patterns that made it easier to support similarly complex schedules such as Schedule C and Schedule A.
Tax AI proves a path to building self-improving agents. Practitioners generate high-value feedback signals by delivering the service. Product workflows preserve those signals as structured evidence. Eval-backed engineering systems validate improvements before they reach production, and an agent-powered loop keeps the system in a continuous self-improving flow.
Thrive Holdings’ structure allows us to replicate this environment in specific industries. Holdings is both an owner and operator, so our combined engineering teams are able to work directly with practitioners and production data from inside businesses like Crete, not as a vendor but as partners. This means the technology, the product, and the service all sit under one roof to help us move faster and build exceptional products.
One senior accountant who spent 180 hours on tax prep last year spent only 15 hours on it this year. She put that time in part toward calling every one of her clients and walking them through their returns, a level of high touch service that wasn’t possible a year ago. The rest of that time she used to take on new clients and expand to new service offerings.
Together, our teams are now using the same three-part design from Tax AI as a blueprint for building workflows in other domains across [Thrive Holdings(opens in a new window)](https://www.thriveholdings.com/); accounting workflows such as bookkeeping and audit, and operational workflows such as IT help desk automation. Across domains and industries, the broader promise of self-improving agents holds. The best agents are steered by people to learn to become more capable, more trusted, and more valuable over time.
_To learn more about the OpenAI team that worked on this project,_[_get in touch_](https://openai.com/contact-sales/)_._