T
traeai
Sign in
返回首页
量子位

5 seconds to edit a 3D scene, Baidu & CUHK & Shanghai AI Lab develop VGGT-Edit, 120 times faster!

9.0Score
5 seconds to edit a 3D scene, Baidu & CUHK & Shanghai AI Lab develop VGGT-Edit, 120 times faster!

TL;DR · AI Summary

A research team from Peking University, Hong Kong University of Science and Technology, Shanghai AI Lab, and Nanyang Technological University proposed VGGT-Edit, an original 3D editing framework that can complete complex 3D scene edits in just a few seconds, up to 120 times faster than existing methods.

Key Takeaways

  • VGGT-Edit outperforms existing methods in semantic consistency, multi-view stabi
  • The residual field prediction mechanism allows the model to learn only the parts
  • By synchronizing text semantics and 3D space features at the same depth level an

Outline

Jump quickly between sections.

  1. Introduces the current problems in 3D editing and the introduction of VGGT-Edit.

  2. No longer going back to 2D, directly editing in 3D space.

  3. The model learns only the parts that need changes, improving efficiency while maintaining background stability.

  4. Ensures text semantics and 3D space features are synchronized at the same depth level.

  5. Automatically determines which views are more trustworthy, ensuring more stable multi-view editing results.

  6. Predicts local changes in the scene to maintain overall stability while making localized modifications.

  7. A dataset with over 100,000 groups, automatically generating multi-view geometrically consistent training data.

  8. Outperforms existing methods in semantic consistency, multi-view stability, and inference speed.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • VGGT-Edit
    • 核心思路
      • 不再绕回2D,直接在3D空间里完成编辑
    • 机制
      • 残差场预测(Residual Field Prediction)
      • 深度同步文本注入(Depth-Synchronized Text Injection)
      • 视角重要性加权
    • 编辑头
      • 专门面向3D编辑任务设计的编辑头
    • 数据集
      • DeltaScene,规模接近10万组,自动化生成多视角几何一致的训练数据
    • 实验结果
      • 在语义一致性、多视角稳定性、推理速度方面超越现有方法

Highlights

Key sentences worth saving and sharing.

  • VGGT-Edit outperforms existing methods in semantic consistency, multi-view stability, and inference speed, with single edits taking only about 5 seconds and achieving up to 120x acceleration.

    Second paragraph

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The residual field prediction mechanism allows the model to learn only the parts that need changes, improving efficiency while maintaining background stability.

    Third paragraph

    ⬇︎ 下载 PNG𝕏 分享到 X
  • By synchronizing text semantics and 3D space features at the same depth level and weighting view importance, VGGT-Edit ensures more stable multi-view editing results.

    Fourth paragraph

    ⬇︎ 下载 PNG𝕏 分享到 X
#3D Editing#VGGT-Edit#Peking University#Hong Kong University of Science and Technology#Shanghai AI Lab
Open original article

< img id="wx_img" src="https://www.qbitai.com/wp-content/uploads/imgs/qbitai-logo-1.png" width="400" height="400">

2026-05-27 17:01:54 Source: Quantum Bit

No More Going Back to 2D

VGGT-Edit Team Submission

Quantum Bit | Official Account QbitAI

The 3D world "can see," but it still can't "change."

From NeRF to 83D Gaussian Splatting, and then models like VGGT and π³, the industry's progress in feedforward 3D reconstruction has accelerated significantly—rebuilding complete 3D scenes in just a few seconds with just a few images.

But the problem lies right here. While these models can understand the 3D world, they don't know how to modify it. You can have it rebuild a room, but it's difficult to tell it:

Move the chair to the window, remove the middle chair, change the gray leather sofa to a white fluffy sofa.

More troublesome is that when complex edits are involved, existing methods often fail—sometimes the chair disappears at certain angles, reappears when viewed from another angle; the background changes shape even if nothing was modified.

To address this challenge, a research team from institutions such as Beijing University, Hong Kong University of Science and Technology, Shanghai AI Lab, and NTU proposed a native 3D editing framework called VGGT-Edit.

The core idea is simple—

Don't go back to 2D, but edit directly in 3D space.

On the DeltaScene test set, VGGT-Edit outperformed existing methods in terms of semantic consistency, multi-view stability, and inference speed. A single edit takes only about 5 seconds, achieving up to 120 times acceleration.

**The Problem Has Always Been in 2D**

Currently, most methods for 3D editing essentially rely on "2D thinking"—breaking down the scene into multiple 2D images, editing them one by one, and then stitching them back together to form a 3D scene.

However, since each view is processed independently, several issues arise:

  • The chair is already removed in one view;
  • When viewed from another angle, the chair reappears;
  • The background drifts;
  • Shadows and flickering appear around object edges.

![Image 1](https://pic-out.zhimg.com/v2-63644016ebf2dbcb19c8fa3f89db6aa1~resize:1440:q75.png?animatedImageAutoPlay=false&animatedImagePlayCount=1&auth_key=1779872424-0-0-ef2f2c8f5c21ab7ce2d499c0d1de9623&bizSceneCode=article_draft&expiration=1779872424&incremental=false&mid=36f69162230003d316d0b8a6d8da20ba&overTime=60&precoder=false&protocol=v2&retryCount=3&sampling=false&sceneCode=editor_copy_outbound&source=bfcaadb1)

Comparison of 3D Editing Methods

Many results look more like "hard-pasted images from different angles" rather than truly stable 3D spaces.

For applications like robotics, AR/VR, and spatial intelligence, this is almost a fatal issue—these scenarios need a consistent 3D world, not just one view looking correct.

**Native 3D Editing, Moving From Concept to Usability**

The core idea behind VGGT-Edit is straightforward: since the problem comes from 2D, don't go back to 2D.

The entire framework is built on top of a VGGT-Like feedforward reconstruction model, inheriting its fast and efficient 3D representation capabilities. Interestingly, instead of regenerating the entire scene, the team proposed a very clever mechanism:

Residual Field Prediction (RFP).

Image 2

Simply put, the model retains the original scene's stable 3D structure while learning only where changes are needed, such as:

  • Moving a chair to the right;
  • Changing the material of the sofa;
  • Deleting an object;
  • Adding new furniture.

These changes are represented as: New Scene = Original Scene + Local Residual Changes

This design has a significant advantage—because most areas don't need changes, the model doesn't need to regenerate the entire world; it only modifies local parts, resulting in very stable unchanged background regions.

This is one of the main differences between VGGT-Edit and many existing methods.

**Text Semantics, First Time Truly Aligning with 3D Space**

The research team found that simply inputting a sentence into the model without further processing often leads to a situation where the model knows what you want to change but doesn't know where to make those changes.

To solve this problem, VGGT-Edit designed a key mechanism:

Depth-Synchronized Text Injection (DSTI)

Essentially, it means keeping text semantics and 3D space features synchronized at the same depth level throughout the process.

Traditional methods usually inject text information only once at the beginning, but VGGT-Edit continuously fuses text semantics at multiple critical layers, ensuring the model always knows:

  • Which region should be modified;
  • What the modification target is;
  • Where the spatial position is located.

Additionally, the team specifically designed a "view importance weighting" system—because not all views are equally reliable, some angles may be occluded, and some views may only show half of an object.

VGGT-Edit automatically determines which view is more trustworthy, ultimately making multi-view editing results more stable.

**A Dedicated Editing Head for 3D Editing Tasks**

In addition to the overall framework, VGGT-Edit also has a crucial component—a dedicated editing head specifically designed for 3D editing tasks.

The research team found that for VGGT-Like models, the original reconstruction head focuses more on "how to recover the scene." However, 3D editing requires solving the problem of modifying only local regions while maintaining overall stability.

Therefore, VGGT-Edit added an extra editing branch specifically predicting local changes within the scene.

This editing head directly acts on the 3D representation space and outputs corresponding residual field changes. Essentially, it learns:

  • Which regions should remain unchanged;
  • Which regions require editing;
  • How to maintain multi-view consistency after editing.

Compared to directly regenerating the entire scene, this approach is more stable and efficient—this is a crucial step that allows VGGT-Like feedforward reconstruction models to gain editing capabilities.

**A 100,000-Sample Dataset Specifically Trained for "3D Editing"**

To train VGGT-Edit, the team created a new 3D editing dataset called DeltaScene, with approximately 100,000 samples covering various scenarios such as living rooms, offices, residential areas, commercial spaces, etc.

![Image 3](https://pic-out.zhimg.com/v2-7e5feb8d2a4831ddf258fc3cfaf0e0f2~resize:1440:q75.png?animatedImageAutoPlay=false&animatedImagePlayCount=1&auth_key=1779872424-0-0-4b6d8a8cec389e568ffd1c804414b27b&bizSceneCode=article_draft&expiration=1779872424&incremental=false&mid=36f69162230003d316d0b8a6d8da20ba&overTime=60&precoder=false&protocol=v2&retryCount=3&sampling=false&sceneCode=editor_copy_outbound&source=bfcaadb1)

Overview of the DeltaScene Dataset

Most importantly, the entire data generation process is highly automated.

The team used Qwen3.5-Plus, SAM3, and Qwen-Image-Editing-Max to automatically generate editing instructions, target recognition, multi-view editing, and 3D consistency filtering, ultimately producing training data that meets the requirement of "multi-view geometric consistency."

![Image 4](https://pic-out.zhimg.com/v2-e30921ba2d686182efbc76aadafb0856~resize:1440:q75.png?animatedImageAutoPlay=false&animatedImagePlayCount=1&auth_key=1779872424-0-0-86a96cd8db1950ca2066549d86c0d804&bizSceneCode=article_draft&expiration=1779872424&incremental=false&mid=36f69162230003d316d0b8a6d8da20ba&overTime=60&precoder=false&protocol=v2&retryCount=3&sampling=false&sceneCode=editor_copy_outbound&source=bfcaadb1)

Process of Constructing the DeltaScene Dataset

For native 3D editing, this step is particularly important—the model needs to learn not just "image changes," but how the same edit maintains spatial consistency across different views.

**3D Editing, Finally Approaching Real-Time Interaction**

From the results, this route proves effective.

On the DeltaScene test set, VGGT-Edit outperformed existing methods in terms of semantic consistency, multi-view stability, and inference speed.

Especially in complex tasks like adding furniture, adjusting positions, and changing materials, many traditional methods still exhibit noticeable "texture artifacts" and geometric drifting, but VGGT-Edit generates results that look much more like a real and stable 3D space.

![Image 5](https://pic-out.zhimg.com/v2-ed88f293d851af060b01945fa6e73ceb~resize:1440:q75.png?animatedImageAutoPlay=false&animatedImagePlayCount=1&auth_key=1779872424-0-0-feccc769c3b0dd452e5e301d3fa19ea6&bizSceneCode=article_draft&expiration=1779872424&incremental=false&mid=36f69162230003d316d0b8a6d8da20ba&overTime=60&precoder=false&protocol=v2&retryCount=3&sampling=false&sceneCode=editor_copy_outbound&source=bfcaadb1)

Qualitative Comparison of Different 3D Editing Tasks

More importantly, the speed—according to the paper, a single edit with VGGT-Edit takes only about 5 seconds, compared to traditional methods that often require long optimization periods, offering up to 120 times acceleration.

This means 3D editing is finally approaching real-time interaction.

For applications like robotics, digital twins, AR/VR, etc., this change is crucial—only when editing speeds are fast enough can the 3D world become truly interactive.

![Image 6](https://pic-out.zhimg.com/v2-da8145d00b486b590e14c3552dec072e~resize:1440:q75.png?animatedImageAutoPlay=false&animatedImagePlayCount=1&auth_key=1779872424-0-0-36bd2d662636ffa0e62d109885e32bd8&bizSceneCode=article_draft&expiration=1779872424&incremental=false&mid=36f69162230003d316d0b8a6d8da20ba&overTime=60&precoder=false&protocol=v2&retryCount=3&sampling=false&sceneCode=editor_copy_outbound&source=bfcaadb1)

Quantitative Results on the DeltaScene Dataset

**Model Begins to Truly Understand "Space Changes"**

One interesting experiment in the paper involves a command that the model had never seen during training—rotate the middle chair 90 degrees clockwise.

Surprisingly, the model successfully completed the edit.

![Image 7](https://pic-out.zhimg.com/v2-0c017a7fe3f06700c33d2eea2a9694c8~resize:1440:q75.png?animatedImageAutoPlay=false&animatedImagePlayCount=1&auth_key=1779872424-0-0-970246eb31c46cc49296e1fbddb6c8dc&bizSceneCode=article_draft&expiration=1779872424&incremental=false&mid=36f69162230003d316d0b8a6d8da20ba&overTime=60&precoder=false&protocol=v2&retryCount=3&sampling=false&sceneCode=editor_copy_outbound&source=bfcaadb1)

Generalizing to Unseen Instructions

This shows that VGGT-Edit has learned more than just fixed templates; it is truly beginning to understand how text semantics map to changes in 3D space.

And this, may be even more important than "generating a 3D world" itself. Because for spatial intelligence, the key ability of the future might not be "generating a world," but whether one can modify the world freely, stably, and in real-time like a human.

VGGT-Edit is taking this forward by one step.

_Paper link: https://arxiv.org/abs/2605.15186_

_Copyright © All rights reserved. Unauthorized reproduction or use in any form is strictly prohibited._

AI may generate inaccurate information. Please verify important content.