T
traeai
Sign in
返回首页
Hugging Face Blog

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

8.5Score

TL;DR · AI Summary

By using the Delta Weight Sync technique, Hugging Face solves the problem of weight synchronization in asynchronous reinforcement learning, reducing transmission volume from TB to MB.

Key Takeaways

  • In asynchronous RL, each training step requires shipping the entire model to the
  • The Delta Weight Sync technique reduces transmission volume from TB to MB by onl
  • This technique achieved a reduction in step-by-step transmission volume from 1.2

Outline

Jump quickly between sections.

  1. Introduce a major issue in asynchronous RL: weight synchronization.

  2. Each training step requires shipping the entire model to the inference engine, wasting significant resources.

  3. Describe how the Delta Weight Sync technique addresses the weight synchronization issue in asynchronous RL.

  4. Reduce transmission volume from TB to MB by only transmitting changes in weights.

  5. Achieved a reduction in step-by-step transmission volume from 1.2 GB to 20 to 35 MB on Qwen3-0.6B.

  6. The Delta Weight Sync technique solves the large model parameter synchronization issue in asynchronous RL, improving efficiency.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Delta Weight Sync 技术
    • 问题
      • 每次训练步骤都需要将整个模型传输给推理引擎,导致大量资源浪费。
    • 解决方案
      • 通过只传输权重的变化部分,将传输量从 TB 级降低到 MB 级。
    • 实际应用
      • 在 Qwen3-0.6B 上实现了每步传输量从 1.2 GB 降至 20 到 35 MB 的显著提升。

Highlights

Key sentences worth saving and sharing.

  • Each training step requires shipping the entire model to the inference engine, wasting significant resources.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Reduce transmission volume from TB to MB by only transmitting changes in weights.

    Paragraph 4

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Achieved a reduction in step-by-step transmission volume from 1.2 GB to 20 to 35 MB on Qwen3-0.6B.

    Paragraph 6

    ⬇︎ 下载 PNG𝕏 分享到 X
#Asynchronous Reinforcement Learning#Large Models#Delta Weight Sync#Hugging Face
Open original article
markdown

> *   [1. The One Terabyte Problem](https://huggingface.co/blog/delta-weight-sync#1-the-one-terabyte-problem "1. The One Terabyte Problem")
> 
> *   [2. Why bf16 RL Weights Are Almost Always Sparse](https://huggingface.co/blog/delta-weight-sync#2-why-bf16-rl-weights-are-almost-always-sparse "2. Why bf16 RL Weights Are Almost Always Sparse")
> 
> *   [3. HF Buckets and the Architecture](https://huggingface.co/blog/delta-weight-sync#3-hf-buckets-and-the-architecture "3. HF Buckets and the Architecture")
>     *   [3.1 What is a Bucket?](https://huggingface.co/blog/delta-weight-sync#31-what-is-a-bucket "3.1 What is a Bucket?")
> 
>     *   [3.2 The Three Boxes](https://huggingface.co/blog/delta-weight-sync#32-the-three-boxes "3.2 The Three Boxes")
> 
> 
> *   [4. The Protocol](https://huggingface.co/blog/delta-weight-sync#4-the-protocol "4. The Protocol")
>     *   [4.1 Safetensors as the Wire Format](https://huggingface.co/blog/delta-weight-sync#41-safetensors-as-the-wire-format "4.1 Safetensors as the Wire Format")
> 
>     *   [4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook](https://huggingface.co/blog/delta-weight-sync#42-the-trainer-side-a-boolean-mask-from-an-optimizer-hook "4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook")
> 
>     *   [4.3 The vLLM Side: a 30 Line Extension](https://huggingface.co/blog/delta-weight-sync#43-the-vllm-side-a-30-line-extension "4.3 The vLLM Side: a 30 Line Extension")
> 
> 
> *   [5. Standing It Up on Spaces, For Real](https://huggingface.co/blog/delta-weight-sync#5-standing-it-up-on-spaces-for-real "5. Standing It Up on Spaces, For Real")
> 
> *   [6. So What Does This Actually Unlock?](https://huggingface.co/blog/delta-weight-sync#6-so-what-does-this-actually-unlock "6. So What Does This Actually Unlock?")
> 
> *   [7. What's Still on Our Plate](https://huggingface.co/blog/delta-weight-sync#7-whats-still-on-our-plate "7. What's Still on Our Plate")
> 
> *   [8. Try It](https://huggingface.co/blog/delta-weight-sync#8-try-it "8. Try It")
> 
> **TL;DR**, because you have models to train and we respect that: 
> *   Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step.
> *   It turns out you do not have to. Between two consecutive RL optimizer steps, **roughly 99% of bf16 weights are bit-identical** (and never less than 98% in the worst case). The actual delta is tiny.
> *   We landed [a TRL PR](https://github.com/huggingface/trl/pull/5417) that encodes just the changed elements as a **sparse safetensors file**, uploads it to a **Hugging Face Bucket**, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to **20 to 35 MB**.
> *   The cherry on top: we ran a full disaggregated training where the **trainer was on one box**, **vLLM lived in a Hugging Face Space**, the **Wordle environment lived in another Space**, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN.
> 
> 
> Async RL just got a lot cheaper. Read on.

Two ways to ship the same weights. Red is wall-clock time during which no tokens are being generated.

* * *

## [](https://huggingface.co/blog/delta-weight-sync#1-the-one-terabyte-problem) 1. The One Terabyte Problem

If you read our previous post on [the landscape of async RL training](https://huggingface.co/blog/huggingface/async-rl-training-landscape), you already know the punchline. Every async RL library, regardless of how it spells "actor model" or which color its NCCL backend is painted, eventually trips over the same root: **weight synchronization**.

The inference engine speaks the policy of step N. The trainer just finished step N+1. The fresh weights have to get from one side to the other before the inference engine starts drifting hopelessly off-policy. This sits on the critical path whether you are running sync or async: a blocking transfer is _wasted idle compute_ of GPUs not generating tokens. With a sparse delta path you collapse that idle time into seconds, and the trainer does not even have to wait for the inference engine to be ready: it just publishes "weights ready" and uploads the weights to the shared bucket the moment its optimizer step finishes, while the inference engine fetches on its own time.

Fireworks put a very memorable number on this in their post [Frontier RL Is Cheaper Than You Think](https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think): for a frontier 1T-parameter checkpoint at fp8 (their setting), a full snapshot is **1024 GiB**, and that is what conventional wisdom says you have to ship every time you update your rollout fleet. That is the kind of number that gets people to start drawing diagrams with mega-clusters, RDMA fabrics, and dedicated cross-region links. Their measured average delta between adjacent checkpoints lands at **20.3 GiB, or 1.98% of the full model**, and "more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints".

Cursor's [Composer 2 report](https://huggingface.co/papers/2603.24477) tells a parallel story. They run training and inference in different regions and stitch them together with a **shared S3 bucket** (their exact words), into which the trainer uploads compressed weight diffs _every training step_. Each cluster independently downloads and reconstructs from the shared delta chain, "requiring no direct connectivity to the training cluster". The two sides never speak to each other about parameters directly. The bucket is the wire.

Both papers agree on three things, and we want to repeat them slowly, because the rest of this post is essentially a faithful open source translation:
  1. 大多数权重在两个相邻的 RL 步骤之间并没有实际变化。
  2. 如果你只发送发生变化的部分,你的带宽账单会大约减少两个数量级。
  3. 如果通过共享对象存储路由这些微小的变化,你就不再需要训练器和推理集群生活在同一个数据中心。

唯一缺少的是一个你可以 pip install 的版本。所以我们写了一个。

[2. Why bf16 RL Weights Are Almost Always Sparse](https://huggingface.co/blog/delta-weight-sync#2-why-bf16-rl-weights-are-almost-always-sparse)

在我们连接任何东西之前,值得理解为什么这个游戏甚至有可能赢。"98% 的权重没有变化"这个说法听起来很可疑,像是那些在演示中有效但在现实世界中崩溃的数字之一。它不是这样的。它来自于 bf16 在 RL 使用的学习率下如何工作的事实。

一个 bf16 数字有 7 个尾数位。在两个连续的 2 的幂次之间,正好有 $2^7 = 128$ 个可表示的值,所以围绕 $\mid w \mid$ 的相邻 bf16 数字之间的间距大约是 $\mid w \mid \cdot 2^{-7}$。当更新低于该间距的一半时,即当 $\mid \Delta w \mid < \mid w \mid / 256$ 时,更新会被 bf16 转换吸收。这是 PULSE 图表中的 "bf16 可见性阈值"。

现在看看 Adam 做了什么。在一个 RL 学习率为 $3 \times 10^{-6}$ 的情况下,对单个权重的更新是:$\Delta w = -\eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}$

归一化的步长 $\hat{m} / (\sqrt{\hat{v}} + \epsilon)$ 大约是一阶的,因此 $\mid \Delta w \mid \approx \eta \approx 3 \times 10^{-6}$。对于大多数权重,$\mid w \mid$ 处于 $10^{-2}$ 到 $10^{-1}$ 之间(PULSE 报告代表 LLM 权重的中位数为 0.019)。在那个量级上的阈值 $\mid w \mid / 256$ 大约是 $4 \times 10^{-5}$ 到 $4 \times 10^{-4}$,这比更新本身更大。

换句话说:优化器在耳语,而 bf16 听不见。更新被四舍五入吸收,$w$ 的字节表示没有改变,从推理引擎的角度来看,这个权重没有移动。乘以几十亿个参数,你会得到超过 99% 的稀疏度,完全免费,没有任何近似。

这就是 PULSE 论文(Mihai & Belilovsky, 2026)中正式提出的论点。他们定义了两个阈值。吸收上限 $10 \eta$ 是 Adam 更新的保守最坏情况,而有效界限 $\eta$ 是你实际上生活的情况。bf16 可见性阈值是 $\mid w \mid / 256$。每当更新低于可见性阈值时,它就会被吸收,bf16 字节不会改变。他们的图 3 将这两个界限与一组代表性 LLM 权重进行了比较,结论是明确的:在 $\eta = 3 \times 10^{-6}$ 时,吸收上限本身已经低于模型中几乎每个权重的可见性阈值。他们在 Qwen2.5(0.5B/1.5B/7B)、Llama-3.2-3B 和 Gemma-3-4B 上实测了这一点,并且一致地发现每步平均稀疏度约为 ~99%,标准偏差为 0.2 到 0.4%,总共 400 个训练步骤。最坏的情况一步仍然高于 98%。因此,不到 1% 发生变化并不是一次幸运的测量;这是算术保证的结果。

我们不需要通过分析来预测这一点(事实上,我们尝试从 Adam 的 $m$ 和 $v$ 统计中预测变化掩码,但回忆起来只有 30% 的准确性,稍后会详细说明)。我们只需要观察哪些字节发生了变化。这是一个每个参数都有一个小布尔张量的微小布尔张量,计算在优化器步骤附近。

将学习率降低到 RL 地域并观察回退到 bf16 的标记跳回到原始刻度。左下角的 256 元素网格是小模型的聚合效果。

[3. HF Buckets and the Architecture](https://huggingface.co/blog/delta-weight-sync#3-hf-buckets-and-the-architecture)

这里故事的第二部分来了,这篇帖子也停止成为 Fireworks/Cursor 的翻译,开始变成 Hugging Face 的事情。

[3.1 What is a Bucket?](https://huggingface.co/blog/delta-weight-sync#31-what-is-a-bucket)

一个是一个 Hub 上的仓库类型,用于高频对象存储。没有提交仪式,没有 PR 工作流程,没有 LFS 特点。你添加文件,列出文件,下载文件。Python 接口只有两个函数:

python
from huggingface_hub import batch_bucket_files, download_bucket_files

# 训练端
batch_bucket_files("my-org/wordle-deltas", add=[(buffer, "deltas/step_000042.safetensors")])

# 推理端
download_bucket_files("my-org/wordle-deltas", files=[("deltas/step_000042.safetensors", local_path)])

就是这样。两个函数调用,你的权重就在飞行中。

在背后,桶由 Xet 支持,这是 Hub 的基于内容定义的块存储层。Xet 查看你上传的每一个文件,根据其实际内容将其切分成块(而不是固定偏移量),并与桶中已有的所有内容进行去重。实用结果,在这种上下文中非常令人满意的是,即使我们太懒惰而不写稀疏编码,而是每一步都上传完整的锚点,Xet 仍然只会传输更改的块。稀疏编码 + Xet 堆栈:我们只为移动的内容付费,并且只支付一次。

这是开源的“共享 S3 桶”等效物,Fireworks 和 Cursor 都会追求,但存储层已经知道内容哈希,你现有的 HF 令牌已经有权限,并且可以与整个堆栈(Spaces、数据集、模型)无缝集成。

[3.2 The Three Boxes](https://huggingface.co/blog/delta-weight-sync#32-the-three-boxes)

完整架构恰好有三个盒子和一个共享基础结构:

  • Trainer. Anywhere you prefer. One GPU, eight GPUs, a laptop with a USB-attached H100—no judgment. Owns the model weights, runs the optimizer, and emits sparse deltas.
  • HF Bucket. A single repository with two prefixes: anchors/ for occasional full snapshots and deltas/ for the sparse patches in between. This is the only agreement between both parties.
  • vLLM Rollout Server. Anywhere you prefer, and crucially, _not necessarily where the trainer is_. Pulls from the bucket, applies the delta, and serves rollouts.
  • Environment. Connects to the rollout server in the usual way (HTTP, function calls, etc.).

The key point to remember, as emphasized by Cursor's paper and reiterated here: the trainer and the rollout server never communicate about weights. They exchange a small POST request containing {"repo_id": ..., "filename": ...}, and that is the entire control plane. The actual data transfer occurs between each side and the bucket in parallel, without any shared network infrastructure.

Why this matters in practice:

  • The rollout server can be in another region, another cloud, or behind NAT inside a Hugging Face Space. It doesn't care.
  • Multiple inference replicas can pull the same delta from the same bucket, and Xet deduplicates the bytes across all of them.
  • The trainer never needs to know how many inference replicas exist, or where they are, or if one of them just crashed.

The trainer writes. Replicas read. The Hub handles the plumbing.

4. The Protocol

Now we can peek under the hood. The protocol consists of four parts: a wire format, a bucket layout, a 30-line vLLM extension, and a trainer-side change detector. It's honestly less code than it sounds.

4.1 Safetensors as the Wire Format

We chose safetensors for both the disk and wire formats. It's already the canonical checkpoint format on the Hub, every reasonable framework can read it, and the header carries arbitrary string metadata. That metadata field is where we hide the protocol.

There are two types of files in the bucket.

Anchors look like a normal checkpoint: one tensor per parameter, full bf16 weights, written every $N$ syncs (we default to $N = 10$).

code
anchors/step_000010.safetensors
  ├── model.layers.0.self_attn.q_proj.weight   (bf16, full)
  ├── model.layers.0.self_attn.k_proj.weight   (bf16, full)
  └── ...
metadata:
  sparse=False, model_version=10, sparsity=0.0

Deltas are the interesting part. For each parameter that actually changed, we store two entries: a flat int32 tensor of element indices, and a bf16 tensor of values at those indices.

code
deltas/step_000011.safetensors
  ├── model.layers.0.self_attn.q_proj.weight.indices   (int32, [num_changed])
  ├── model.layers.0.self_attn.q_proj.weight.values    (bf16,  [num_changed])
  ├── model.layers.0.mlp.gate_proj.weight.indices
  ├── model.layers.0.mlp.gate_proj.weight.values
  └── ...
metadata:
  sparse=True, model_version=11, sparsity=0.9938, changed_params=[...]

A few nice consequences of this choice:

  • A delta is a _file_. You can open it with safe_open(...) in Python and inspect every tensor in it. No proprietary framing, no length prefixes, no version handshake.
  • The metadata is self-describing. The receiver reads sparse=True/False and branches. There is no separate manifest.
  • It is zero-copy via mmap on the inference side, which matters when you're doing this every few seconds.

The cadence is straightforward: anchor every Nth step, delta in between. Both end up in the same bucket under anchors/ and deltas/ prefixes. Each new inference replica only needs to grab the most recent anchor and then replay the deltas since.

Ten training steps. Anchor (full snapshot) on step 1 and step 6, sparse delta on every other step. Files land in the bucket as you watch.

4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook

The trainer needs to know which bf16 elements actually flipped. We achieve this with a tiny BF16ChangeDetector that registers a pre-step and post-step hook on the optimizer:

code
class BF16ChangeDetector:
    def __init__(self, model, optimizer):
        self._pre_step_bf16: dict[str, torch.Tensor] = {}
        self._validated_masks: dict[str, torch.Tensor] = {}
        optimizer.register_step_pre_hook(self._pre_step_hook)
        optimizer.register_step_post_hook(self._post_step_hook)

    def _pre_step_hook(self, opt, args, kwargs):
        for p in self._params:
            self._pre_step_bf16[name_of(p)] = p.detach().to(torch.bfloat16).cpu().clone()

    def _post_step_hook(self, opt, args, kwargs):
        for p in self._params:
            self._validated_masks[name_of(p)] = (
                p.detach().to(torch.bfloat16).cpu() != self._pre_step_bf16[name_of(p)]
            )

The actual code in the PR includes some additional plumbing (matching optimizer parameter objects to model parameters via data_ptr(), because Accelerate wraps them as different Python objects), but the concept fits on a napkin: snapshot, step, diff.

This is ground truth. We _tried_ the more elegant approach of predicting the mask from Adam's $m$ and $v$ statistics, using the bf16 ULP threshold directly. It works in theory. In practice, recall was around 30%, meaning we would have shipped a delta missing two-thirds of the actual updates. Adam's normalization is messy enough that the analytical threshold is not tight. So we simply compare bytes. It costs one bf16 CPU snapshot of the model on the trainer side, which we are willing to pay.

The four phases of the new _sync_weight flow are:

  1. 继续推理时上传数据。 训练器将掩码元素编码到一个 safetensors 缓冲区,并将其推送到存储桶中。vLLM 在整个过程中仍然愉快地服务旧策略。
  2. 暂停 vLLM。 只需要一个短的 HTTP 调用,几百毫秒。
  3. 发送 `/update_weights` 信号。 发送存储桶坐标。vLLM 下载、应用并返回。
  4. 恢复运行。 vLLM 已经重新上线。

日志行讲述了故事:

code
Delta: 1234567/200000000 元素发生变化(稀疏度=99.38%)
[delta_engine] 用户/wordle-deltas/deltas/step_000042.safetensors 已上传(27.4 MB,...)
权重同步:完成。总耗时 9.4 秒(推理暂停 1.1 秒)

关键在于括号中的内容。推理被暂停了 1.1 秒。剩下的 9.4 秒用于上传,这发生在回滚服务器仍在生成令牌的时候。使用 NCCL 时,我们支付了完整的同步时间为暂停时间。现在我们将其作为后台时间来支付。

单次同步,端到端。切换到基于存储桶的增量传输和 NCCL 广播,并尝试副本计数切换,以查看扇出的故事。

4.3 vLLM 方面:一个 30 行扩展

vLLM 对此有一个干净的抽象,称为 WeightTransferEngine。我们实现了一个 DeltaWeightTransferEngine,其 receive_weights 方法的精神如下:

code
def receive_weights(self, update_info, load_weights):
    download_bucket_files(update_info.repo_id, files=[(update_info.filename, local_path)])
    with safe_open(local_path, framework="pt", device="cpu") as f:
        meta = PatchMetadata.from_metadata_dict(f.metadata())
        if not meta.sparse:
            # 锚点:为未来的增量提供每个张量和快照
            for name in f.keys():
                tensor = f.get_tensor(name)
                self._bf16_snapshot[name] = tensor.clone()
                load_weights([(name, tensor)])
        else:
            # 增量:对快照应用(索引,值),并将完整张量传递给 vLLM
            for name in json.loads(meta.changed_params):
                indices = f.get_tensor(f"{name}.indices").long()
                values = f.get_tensor(f"{name}.values")
                snap = self._bf16_snapshot[name].flatten()
                snap[indices] = values
                self._bf16_snapshot[name] = snap.reshape(self._bf16_snapshot[name].shape)
                load_weights([(name, self._bf16_snapshot[name])])

我们通过 vLLM 的 --worker-extension-cls 标志注册它,这意味着 不需要 vLLM 的 fork。你将 TRL 安装到与 vLLM 相同的镜像中,指向我们的类,就完成了。

值得一提的是:vLLM 本身有一个正在进行的努力,要在本地原生支持稀疏权重传输,vllm-project/vllm#40096。它直接在 WeightTransferEngine 基类上添加了 receive_sparse_weights()trainer_send_sparse_weights(),将补丁编码为 (indices, values),并通过 index_copy_() 就地应用,完全消除了 GPU/CPU 验证往返。该 PR 报告了一次稀疏补丁从 Qwen3-1.7B 传输 0.16 MB 在 0.40 ms,而一次完整的密集传输则需要 942 MB 在 192 ms

我们在推理方面的一个诚实的免责声明:我们保持一个 CPU bf16 模型快照,以便从稀疏 (indices, values) 补丁重建完整张量,因为 vLLM 今天期望完整张量。一旦 #40096(或其继任者)落地并暴露一个就地稀疏 load_weights 路径,我们就可以直接在 GPU 上应用索引并丢弃快照!

5. 在 Spaces 上真正搭建起来

这是我们得意的地方。我们迄今为止描述的一切都在你的笔记本电脑上工作,但通过路由权重经过一个 Hub 存储桶的关键点是训练器和回滚服务器不必彼此靠近。所以我们进行了一个完全分散的训练,有三台机器,其中没有任何一台共享网络:

  • 一台运行 训练器 的单 GPU 机箱。
  • 运行 vLLM 和我们扩展类的 Hugging Face Space(Docker SDK,L4 GPU)。
  • 运行 Wordle 环境 服务器的第二个 Hugging Face Space(CPU),具有 256 个并发会话容量。
  • 中间的 Hub 存储桶

设置这个过程实际上只需要几个 hf CLI 调用。vLLM Space 的 Dockerfile 大致上是上游 vLLM 镜像加上 pip install trl@... 加上入口点:

code
FROM vllm/vllm-openai:latest
RUN pip install "trl @ git+https://github.com/huggingface/trl.git@delta-weight-sync"
ENV VLLM_SERVER_DEV_MODE=1
EXPOSE 7860
ENTRYPOINT ["vllm", "serve", "Qwen/Qwen3-1.7B", \
    "--host", "0.0.0.0", "--port", "7860", \
    "--worker-extension-cls", "trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension", \
    "--weight-transfer-config", "{\"backend\":\"nccl\"}", \
    "--max-model-len", "32768", \
    "--gpu-memory-utilization", "0.8"]

将其部署为一个 Space:

code
hf repos create $USER/vllm-wordle-inference \
    --type space --space-sdk docker --flavor l4x1 \
    --secrets HF_TOKEN=$HF_TOKEN
hf upload $USER/vllm-wordle-inference examples/scripts/openenv/vllm_space/ --type space

然后从任何可以进行 HTTPS 通信的地球上的地方启动训练:

code
python examples/scripts/openenv/async_wordle.py \
    --vllm-server-url https://$USER-vllm-wordle-inference.hf.space \
    --env-url https://openenv-wordle.hf.space \
    --delta-sync-repo-id $USER/wordle-deltas \
    --model Qwen/Qwen3-1.7B
markdown
The trainer never opens a port. The Space never sees the trainer's IP. The Wordle environment does not know either of them exists. They all talk to the Hub. Training converged on the immediate-EOS sanity check, then on real Wordle rollouts: reward went up, delta payloads stayed in the 20 to 35 MB band, and the inference-paused window per sync stayed around a second. The full run logs are linked in the companion PR.

## [](https://huggingface.co/blog/delta-weight-sync#6-so-what-does-this-actually-unlock) 6. So What Does This Actually Unlock?

A few things, and we think they are big.

**Async RL training without a cluster.** If you have one GPU and a Hugging Face account, you can now do real disaggregated training. Your trainer is on the GPU; your rollout fleet lives in Spaces; your environment lives in another Space; weights move through a bucket. This used to require either a colocated setup (with all the throughput compromises that brings) or a real cluster with shared networking. It does not anymore.

**Multi-replica inference, for free.** Stand up two vLLM Spaces, or ten. They all pull from the same bucket. Xet content-addressed storage so consecutive anchors share chunks at rest (which keeps your bucket from blowing up), and the Hub's edge cache makes repeated downloads of the same file cheap to serve. Want a globally distributed rollout fleet? That is now a small DevOps exercise, not a research project.

**A wire format you can debug with your existing tools.** A delta is a safetensors file. You can `safe_open` it from a notebook, list its keys, inspect the indices, compute the sparsity yourself. We have spent enough hours in tcpdump on opaque NCCL streams to appreciate this.

**A path to frontier scale.** The 20 to 35 MB number is for Qwen3-0.6B. The interesting question is what the curve looks like once you turn the dial up. Let us do the napkin math.

Take Llama-3.1-405B. In bf16 that is **810 GB** on disk. PULSE measures ~99% mean per-step sparsity at RL learning rates, so the actual delta sits around 1% of the parameters. Their deployment-measured encoding hits **108 MB on a 7B model**, which is the **~130×** reduction PULSE reports. Scaled linearly to 405B, the delta lands at roughly **6 GB per step**.

What does that buy you in wall-clock? NCCL is fast inside a cluster, sure. Assume a generous 100 GB/s aggregate broadcast bandwidth (multi-node, RDMA, the works). A full sync is `810 GB / 100 GB/s ≈ 8 seconds` of inference pause, every step. With the delta path, the trainer streams 6 GB to a bucket _in the background_ while generation keeps running, and the rollout server's actual paused window is just the apply step, which on this scale lands at a couple of seconds. So even before we leave the cluster, delta cuts the visible pause by 4× and the bytes on the wire by ~130×.

Now leave the cluster. NCCL straight up does not work across clouds. Once you want a rollout fleet in `us-east`, another in `eu-west`, maybe one in a Hugging Face Space, the bucket-based path is the _only_ path. At 1 GB/s of usable internet bandwidth, a single full broadcast would take 13 minutes; the delta does it in 6 seconds.

For a 1 TB-class model in the Fireworks framing, their own measured numbers show **20.3 GiB deltas vs the 1024 GiB full snapshot **, a ~50× reduction. PULSE's tighter, sparse encoding would push that further (extrapolating ~15 GB per delta, closer to ~65×). Either way, you are in a regime where shipping weights through commodity object storage stops being a hack and starts being the only sensible architecture.

## [](https://huggingface.co/blog/delta-weight-sync#7-whats-still-on-our-plate) 7. What's Still on Our Plate

We are not pretending this is finished. Here is the honest list.

*   **Two CPU bf16 snapshots, one too many.** The trainer keeps one (for the change detector) and the rollout server keeps one (to reconstruct full tensors for vLLM's `load_weights`). The first one we are stuck with until someone finds a tight analytical mask, which is harder than it looks. The second one goes away when vLLM gains a sparse `load_weights` API. PR forthcoming.
*   **Fixed anchor cadence.** We currently dump a full anchor every $N$ steps. An adaptive policy ("anchor when cumulative drift exceeds X") would cut anchor cost on long runs.
*   **Multi-node FSDP2 trainers.** The `BF16ChangeDetector` is built around per-process optimizer hooks. It should generalize cleanly to FSDP2, but we have not measured it at multi-node scale yet. There is a `TODO` in the PR with our name on it.
*   **Hooking into the optimizer.** Our attempt at predicting the mask from $\left(\right. m , v \left.\right)$ alone gave low recall, which means the analytical bf16 threshold is doing something more subtle than the textbook formula suggests. We would love to hear from anyone who has cracked this.
*   **Stacking with on-the-wire compression.** Sparse safetensors and per-chunk gzip are orthogonal. We have not tried combining them yet. Although we don't expect huge compression gains.

## [](https://huggingface.co/blog/delta-weight-sync#8-try-it) 8. Try It

*   The PR: [huggingface/trl#5417](https://github.com/huggingface/trl/pull/5417). Branch is `delta-weight-sync`.
*   The full Wordle example: `examples/scripts/openenv/async_wordle.py`.
*   The Spaces Dockerfiles: `examples/scripts/openenv/vllm_space/` and `examples/scripts/openenv/wordle_space/`.
*   Background reading: our [async RL landscape post](https://huggingface.co/blog/huggingface/async-rl-training-landscape), the [Fireworks 1 TB post](https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think), the [Cursor Composer 2 report](https://huggingface.co/papers/2603.24477).

AI may generate inaccurate information. Please verify important content.