T
traeai
Sign in
返回首页
美团技术团队

Using Agent Evaluation Framework to Manage AI Coding: A Case Study of 310K-Line Codebase Refactoring

8.5Score
Using Agent Evaluation Framework to Manage AI Coding: A Case Study of 310K-Line Codebase Refactoring

TL;DR · AI Summary

Meituan applied the 'human alignment → machine alignment' framework from agent evaluation to manage AI-generated code, enabling incremental refactoring of a 310K-line system without disrupting delivery.

Key Takeaways

  • Apply the 'human alignment → machine alignment' method to unify team consensus a
  • AI rapidly identified 10 deeply hidden performance issues, shifting expertise va
  • Digest technical debt incrementally by embedding it into regular feature tasks,

Outline

Jump quickly between sections.

  1. With over 90% of code generated by AI and expanding team size, lack of unified standards leads to rapid system decay and complexity explosion.

  2. Legacy data models lack scalability, forcing 'siloed' development that hinders fast iteration needs.

  3. Deep code rot creates 'spaghetti code'; combined with diverse backgrounds using AI, new technical debt accumulates rapidly.

  4. Leverage AI to identify tech debt, define AI-friendly coding standards, and establish SOPs for gradual refactoring.

  5. Use AI to exhaustively scan high-risk boundaries, identifying 3 P0 and 2 P1 issues with minimal resource input.

  6. Transform engineering layering and modeling rules into always-enforced AI Rules, integrated into pre-CR validation.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • AI编码时代的工程治理
    • 核心挑战
      • 90%代码由AI生成
      • 系统复杂度超31万行
      • 团队背景多元,协作风险高
    • 解决方案框架
      • 人人对齐:统一团队共识
      • 人机对齐:AI规则约束输出
      • 渐进式重构:技术债随迭代消化
    • 关键实践
      • AI辅助识别P0/P1技术债
      • 规范转为AI Rule前置校验
      • 职责边界沉淀为Skill

Highlights

Key sentences worth saving and sharing.

  • Engineers quickly pinpointed 10 deeply buried, nearly undetectable performance issues using AI assistance.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The value of experience is shifting from 'seeing everything' to 'judging what matters most'—the irreplaceable human role.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Digesting technical debt incrementally by embedding it into routine feature work offers a third path beyond full rewrite or special projects.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
#AI Coding#Engineering Governance#Technical Debt#Agent Evaluation#Refactoring
Open original article

Managing AI Coding with Agent-Based Evaluation — Practical Experience in Refactoring 310,000 Lines of Code – Meituan Tech Team

[Meituan Tech Team](https://tech.meituan.com/ "Meituan Tech Team")

  • [Latest Articles](https://tech.meituan.com/ "View latest articles")
  • [Article Archive](https://tech.meituan.com/archives "View archived content")
  • [Tech Salon](https://tech.meituan.com/tech-salon "Learn about tech salons")
  • [About Us](https://tech.meituan.com/about "Learn more about us")

© 2026 Meituan Tech Team All Rights Reserved.

[Managing AI Coding with Agent-Based Evaluation — Practical Experience in Refactoring 310,000 Lines of Code](https://tech.meituan.com/2026/05/07/agent-ai-coding.html)

May 7, 2026 | Author: Business R&D Platform Article Link | 7,364 words | 15 minutes reading

When over 90% of a team’s codebase is generated by AI, and a complex business system continues to grow rapidly to 310,000 lines of code, you’ll encounter a counterintuitive truth: AI coding does not automatically reduce complexity. Without unified standards, different developers using AI produce wildly inconsistent code styles, accelerating system decay.

This article documents how we completed this refactoring without halting business delivery. Along the way, we accumulated three key insights—hopeful that these practical experiences offer reusable strategies.

  • Insight 1: Use agent-based evaluation to manage AI coding. Our team leads the agent evaluation business and has developed a core alignment philosophy in practice: “Human Alignment → Human-AI Alignment.” We discovered that managing AI coding follows the same fundamental logic—first establish team-wide consensus (human alignment), then formalize it into executable constraints for AI (human-AI alignment). Essentially, the same methodology is being reused across two domains.
  • Insight 2: AI is redefining the value boundary of “experience.” With AI tools, engineers quickly identified 10 performance bottlenecks—something that previously required years of experience to perceive globally. Now, every team member can rapidly develop this global awareness. The value of experience is shifting from “seeing everything” to “judging what matters.”
  • Insight 3: Technical debt can be iteratively resolved like business requirements. Industry discussions on refactoring usually fall into two extremes: total rebuild or special project approval. We propose a third path: break down technical debt into “side actions” within regular business iterations, gradually digesting it through incremental progress.

I. Background

The Agent Evaluation System has long supported multiple core business scenarios, simultaneously handling data production, workflow orchestration, quality control, and multi-person collaboration—resulting in high business and engineering complexity. Specifically, the complexity manifests in three dimensions:

  • Business still in exploratory phase, leading to highly ambiguous requirements: The entire industry is exploring agent evaluation, and users themselves don’t know how to evaluate effectively. This context results in urgent yet vague demands—urgent because rapid experimentation is needed, vague because stakeholders aren’t certain whether this path holds real value.
  • Massive and frequent iteration volume: From under 50,000 lines in June 2025, the system expanded rapidly to 310,000 lines, maintaining a high load of 16 new features per month (80% business needs + 20% technical tasks).
  • “Cartesian product”-level scenario matrix: The system supports six multimodal data evaluation types at the base layer, builds multiple core task views and fine-grained business actions at the upper layer, and integrates over ten quality inspection mechanisms. These capabilities interweave various tagging systems and dynamic assignment strategies, meaning the system must handle hundreds or even thousands of distinct complex business flow combinations daily.

II. Why Refactor?

As the business enters a fast-paced iteration and experimentation phase, the growing scale of operations clashes sharply with the original underlying architecture, forcing us to initiate this large-scale refactoring. The core drivers stem from three critical pain points:

1. Business models urgently need upgrading—the old architecture cannot support exploratory business

With increasing richness and complexity in business interactions, the old data model lacks scalability, resulting in “siloed” feature development. Almost every new business format requires writing new code.

2. Severe code degradation dragging down iteration efficiency

For years, we’ve used a “package-by-demand” development pattern, lacking proper engineering layering. Complex logic such as Controller is mixed together in single packages, creating severe “spaghetti code.” At 310,000 lines, this deep technical debt makes daily development extremely fragile—“one small change triggers a chain reaction,” causing immense frustration for frontline developers and severely bottlenecking delivery speed.

3. Collaboration risks amplified—lack of standardized AI coding accelerates system decay

Within about a year, team size tripled, with diverse technical backgrounds including high-concurrency systems, machine learning offline training, backend management, and interns—all with limited experience in developing complex business systems. In this environment of high turnover and cross-technical-stack collaboration, combined with over 90% of code being AI-assisted, failing to establish strict architectural standards would inevitably lead to uncontrollable system decay and new technical debt.

Thus, we needed not just engineering refactoring—but refactoring designed specifically for AI coding practices. Only with clear standards can we eliminate legacy technical debt and prevent new debt from accumulating.

III. Refactoring Timeline and Execution Path

Image 1

Phase One: Define Problems, Leverage AI to Map Technical Debt (Started February 2026)

Under intense demand pressure, tackling technical debt faces a stark reality: the volume is too large to fully review or comprehend.

With codebases exceeding 310,000 lines, manually reading every line to build a reliable global understanding is unrealistic. Our codebase also exhibits typical high-risk characteristics: incomplete documentation, hidden logic, and historical compatibility branches buried in details. A seemingly simple interface may hide an extremely long call chain. Thus, the biggest challenge in mapping technical debt lies in the fact that human effort can never quickly enumerate or penetrate these intricate dependencies—while any single piece of code is readable, no one can instantly trace all 310,000-line call chains.

We adopted a method better suited for complex systems: expert-guided targeting + AI-assisted scanning.

Instead of manual traversal, we had core developers define high-risk boundaries, then delegated the exhaustive search and scanning work to AI. Through this approach, we quickly identified P0/P1 level technical debts at the system’s core (e.g., business model flaws, database query performance issues, state management debt, index-related debt).

Our key takeaway here was: AI excels at helping us “see everything,” but determining which problems are most critical—and which should be prioritized—still requires human judgment. Specifically, humans define P0/P1 issues and priorities; AI performs exhaustive scans within those scopes—such as identifying business model issues, locating performance bottlenecks in large datasets, and auditing state management and indexing-level technical debt.

The ROI from this step was exceptionally high. With minimal resource investment, we mapped three P0 and two P1 technical debts. But what surprised us most was this:

Engineers quickly used AI assistance to pinpoint 10 deeply hidden, nearly invisible performance issues—extremely difficult to detect by eye alone. These were buried deep within complex call chains, even seasoned engineers couldn’t exhaustively trace them manually. This was nearly impossible under pure manual code review.

This result forced us to reconsider the definition of “experience.” Previously, “seeing everything” was the core advantage of senior engineers—you needed to spend three years immersed in the system to develop a holistic sense of call chains, implicit dependencies, and historical compatibility logic. But AI has lowered the barrier to “seeing everything” to near zero. The value of experience is shifting from “seeing everything” to “judging what matters”—this is where humans remain irreplaceable.

This insight profoundly influenced our subsequent steps: only when problem definitions are crystal clear can later standardization, layering, and migration avoid becoming aimless efforts.

Image 2

Phase Two: Research and Establish AI-Friendly Development Standards (Completed late February 2026)

Through technical debt mapping, we clarified *where* to refactor. Next came the question: *how should code be written?* Given that 90% of code relies on AI coding, the central challenge becomes: How to elevate the experience of one or two proficient AI users into high-quality, team-wide standards?

#### Why the value of standards has been amplified

In traditional development, coding standards primarily aid team collaboration, code reviews, and onboarding new members. But once AI becomes the primary coding engine, the role of standards undergoes a fundamental shift. Large language models strongly depend on current context and existing code patterns. If the codebase lacks consistency and team members interpret standards differently, AI won’t self-correct—it will amplify differences, leading to “a thousand people, a thousand styles” in collaborative output. Therefore, in the era of AI coding, development standards have evolved into infrastructure that constrains AI output and prevents further accumulation of technical debt—not merely a set of collaboration guidelines.

#### Managing AI Coding Using an Agent Evaluation Framework

But simply making AI follow standards isn’t enough—AI can only execute inputs, not replace team consensus. If team members haven’t first aligned on layering principles, modeling approaches, and dependency boundaries, the same standard will be interpreted differently by different people.

This problem reminded us of our own core business. Our team runs agent evaluation, and through long-term practice, we’ve developed a core philosophy:

  • Standard Alignment (Human Alignment): A strong, centralized figure must align evaluation standards across product, operations, algorithm, QA, and other roles—a “dictator” is better than ten “democrats.”
  • Human-AI Alignment: After standard alignment, optimize model selection and evaluation metrics to achieve human-AI alignment—only when agreement reaches a baseline threshold (e.g., 90%) can we trust machine-generated evaluations.

We realized that managing AI coding shares the exact same foundational logic as evaluating agents. First, align team engineering standards via norms (human alignment); second, constrain LLM outputs using AI Rules and Skills (human-AI alignment). A team focused on agent evaluation uses evaluation thinking to solve engineering governance challenges.

The sequence is crucial: first human alignment, then human-AI alignment. Many teams think configuring AI Rules is sufficient—but the real bottleneck lies in people, not tools. Without team consensus, even the best AI Rules will be interpreted differently by different individuals. Human consensus is the prerequisite for effective AI constraint.

#### Transforming Standards into Executable AI Constraints

We first studied mature development standards from industry peers and adapted them to our internal workflows, distilling a set of AI-friendly engineering constraints—including architectural layering rules, business domain modeling conventions, and storage layer regulations. The critical step? We didn’t leave these standards as static documents. Instead, we implemented them as always-on AI Rules, embedded directly into the AI coding process and pre-integrated into the pre-CR (pre-code review) stage, enabling basic compliance checks before submission.

Meanwhile, for areas prone to disagreement—especially responsibility boundaries—we established team-wide consensus around the distinction between “orchestration-type” and “capability-type” components, and codified this understanding into progressively loaded Skills during coding.

Image 3
Image 4

Phase Three: Establish SOPs—Gradual Refactoring Through Iterations (March–April 2026)

#### Action 1: 100% AI-Assisted Refactoring of Architecture and Decoupling

We migrated legacy “demand-driven package” spaghetti code into a standardized four-layer architecture (Starter / Application / Infrastructure / Common) and a new structure organized by business domains. However, this wasn’t just about directory restructuring—it was a systematic effort to address long-standing deep coupling issues in legacy code, especially the leakage and upward propagation of low-level PO (Persistent Object) data throughout the entire chain.

To tackle this, we executed three steps:

  1. Complete the business object and data transformation layer, consolidating scattered conversion logic.
  2. Rebuild interface contracts in the Application layer, strictly blocking low-level data objects from leaking upward.
  3. Fix upstream parameter dependencies based on the new contract.
Image 5

This type of refactoring features well-defined rules but spans vast areas and involves intensive repetitive work. Our approach: first, let the lead refactoring engineer personally migrate the two most complex packages, during which they distilled a standardized, AI-executable SOP. With this SOP, refactoring no longer depended on individual expertise—other team members could simply guide AI to complete remaining package migrations, while focusing their own efforts on semantic validation and code reviews. Using the “lead engineer sets example → SOP distributed → full team parallel execution” model, we rapidly completed the structural migration of over ten core packages.

Image 6

#### Action 2: Zero-Dedicated-Time Refactoring—Gradually Refactor Business Models via Business Requirements

This was the core challenge of the refactoring. In the industry, refactoring typically follows two paths: either a complete rewrite or applying for dedicated time. We took a third route—breaking down technical debt into “side actions” within regular business iterations, digesting it incrementally without requesting a single day of dedicated refactoring time.

(This is Part 2/2 — maintaining consistent translation style)

The specific approach is to break down technical debt into daily high-priority requirements. For example, during a core feature iteration, we seamlessly designed and implemented a new business model; during another feature upgrade, we introduced a brand-new quality inspection model, completing full migration by late March (successfully supporting multiple business pipelines, as well as complex cross-validation across various views and regions).

The difficulty lies in the precision of decomposition—determining which business needs can “naturally absorb” which technical debts requires careful evaluation on a case-by-case basis: neither should refactoring slow down business delivery, nor should business demands bypass existing technical debt and accumulate more. Ultimately, I achieved a smooth upgrade of the core data model without interrupting ongoing business delivery.

#### Action 3: Refactoring Quality Assurance

1. Building AI CR and Pre-PR Mechanisms

As AI coding efficiency saw exponential growth, we quickly hit the "bottleneck effect": Code Review (CR) became the most congested link in the entire pipeline. While AI drastically reduced coding time, pressure systematically shifted downstream to the CR stage. If CR efficiency doesn’t improve, the productivity gains from AI coding will be entirely eroded by the CR bottleneck.

Our team reached a consensus:

  • The value of human CR should shift from “Did you write it correctly?” to “Are we solving the right problem under the right constraints?”
  • Let AI handle rule-based issues and perform initial screening of business logic.
  • Humans focus on early-stage technical design reviews, ensuring final code implementation aligns with the technical plan and identifying any logical flaws.

Our practical experience:

  1. Introducing Pre-PR (Pre-review) Mechanism:
  • Before submitting code, developers must use AI to conduct multiple rounds of self-checks, fixing all issues detectable by AI (including style, bugs, exception handling, consistency, extensibility, and performance).
  • After confirmation, submit a standardized PR document (highlighting key changes, impact scope, and critical business logic for review—AI generates this based on code changes using a template).
  • This ensures reviewers receive high-quality code that has already filtered out basic compliance errors, allowing them to focus solely on core business semantics, significantly reducing cognitive load.
  1. High-tier models reviewing low-tier models: Use high-performance models as Judge Models to evaluate outputs from lower-tier models.
  1. Cross-vendor model adversarial review: Use models from different vendors to audit each other’s code output. Leveraging diverse model capabilities creates complementary coverage, resulting in broader real-world CR coverage.
Image 7

2. Research and Benchmarking: Establishing AI-Assisted Test Case Generation Standards

Our team practices 100% RD-as-QA—developers also serve as testers. In exploring AI-assisted self-testing, two paths naturally emerged: Path A: Let AI fully generate test cases, with humans only doing final validation. Path B: Humans define testing scope and risk levels, while AI handles code scanning and fills in test step details.

After practice, Path A quickly revealed serious engineering issues—AI lacks holistic business understanding, heavily relies on PRD quality, often misses high-risk scenarios involving hidden dependencies, and generates large volumes of low-value edge cases, increasing review burden. After consulting professional QA teams, we confirmed Path B (human-led, AI-assisted) as the correct direction, and codified it into a Human-in-the-loop testing SOP:

| Step | Objective | What Humans Do | What AI Does | AI Efficiency Gain | |------|-----------|----------------|--------------|--------------------| | Step 1: Define Scope | Identify which interfaces to test | Review and confirm final test scope, eliminate false positives | Automatically collect affected interface lists via bidirectional scan of traffic monitoring + code changes | Saves time manually searching through code; prevents omissions | | Step 2: Risk Classification | Determine depth of testing per interface | Assess risk level based on AI-provided info, decide testing depth | Analyze code and answer three questions: How much changed? Where are branches? Is old data compatible? | Reduces “code reading for risk assessment” from hours to minutes | | Step 3: Design Grouping | Minimize number of test cases | Review grouping rationality, add special business scenarios | Apply decision table method (“split then merge”) to auto-generate minimal test case combinations | AI computes faster and with fewer errors than humans in combinatorial explosion scenarios | | Step 4: Generate Steps | Write executable test steps | Verify steps match actual changes, add boundary conditions | Expand using “one action, multi-dimensional validation” template, generate steps matching change level | Batch-generates structured test cases; humans only need to review, not write from scratch | | Step 5: Validate Coverage | Ensure no gaps or overcoverage | Final confirmation that coverage matrix has no blind spots | Auto-generate interface × dimension coverage matrix, flag uncovered items | Cross-verification via manual checks easily misses items; AI comparison is zero-cost |

Four. Summary

Key Insights We’ve Learned

  1. Manage AI Coding Using an Evaluation Agent Framework: Our team leads agent evaluation projects. Through practice, we’ve distilled the core principle: “Human alignment → Human-machine alignment.” The underlying logic for managing AI coding mirrors this exactly. First, align the team around shared understanding (human alignment), then formalize that consensus into executable constraints for AI (human-machine alignment). Reverse the order, and even the best AI rules become empty paper promises.
  1. AI Redefines the Value Boundary of “Experience”: With AI tools, engineers identified 10 performance risks within days—something previously requiring three years of experience to develop. Now, everyone has access to global code awareness. The value of experience is shifting from “seeing everything” to “judging what matters”—this is where human judgment remains irreplaceable.
  1. Technical Debt Can Be Iteratively Consumed Like Business Requirements: Refactoring doesn’t require dedicated time slots—it requires decomposition capability. We didn’t request a single day for refactoring; instead, 310,000 lines of code were gradually consumed during regular business delivery. The key is whether you can decompose technical debt into natural side effects of business tasks. This demands strong technical judgment, but once operational, refactoring ceases to be a zero-sum game against business priorities.
  1. The Engineer’s Role Has Evolved: When 90% of code is generated by AI, team members should shift focus from “writing code” to “designing and maintaining an engineering environment where AI reliably produces high-quality code.”
Image 8

Action Guide: If Your Team Wants to Implement This

  • Step 1: Audit your technical debt. Don’t try manual traversal—have core developers identify high-risk areas, then let AI perform exhaustive scans. Low investment, high payoff: establishes global situational awareness.
  • Step 2: Define standards and embed them into AI Rules and Skills. First, align the team on layered principles, modeling approaches, dependency boundaries, etc., then solidify these into AI rules that are always loaded during coding. Without embedding standards into the AI toolchain, they remain empty guidelines.
  • Step 3: Have the lead R prototype and document reusable migration SOPs. Avoid letting everyone reinvent the wheel. Let one person run through a complete migration process for a module, document the steps as an AI-executable SOP, then scale it across the team.
  • Step 4: Establish a Pre-PR mechanism. Require every submission to undergo AI self-check based on team standards before PR, filtering out basic issues so human CR focuses only on business semantics. After AI accelerates coding, CR becomes the new bottleneck—this step cannot be skipped.

Business Development Platform, Algorithms, AI Coding, Large Models, Agent

#See More

[Previous: LARYBench Released: Defining Embodied Action Representation ImageNet, First Measurement of Generalization from Human Videos](https://tech.meituan.com/2026/04/27/longcat-larybench.html "LARYBench Released: Defining Embodied Action Representation ImageNet, First Measurement of Generalization from Human Videos")

#Let's Discuss

If you spot errors or have questions about the content, follow Meituan Tech Team’s WeChat Official Account (meituantech) and leave us a message in the backend.

Image 9: Meituan Tech Team WeChat QR Code

Share frontline technical practices, accumulate learning experiences

AI may generate inaccurate information. Please verify important content.