A Guide to AI Cold Starts on Cloud Run

TL;DR · AI Summary
This article delves into managing AI cold starts on Cloud Run, offering several best practices to reduce latency.
Key Takeaways
- Parallel downloading model files can shorten Cloud Storage cold start times to t
- Using FUSE to mount buckets as local file systems simplifies operations but may
- Optimizing inference engine initialization can significantly reduce cold start d
Outline
Jump quickly between sections.
Introduces a developer's question on Reddit about managing Cloud Run cold starts across multiple regions, sparking interest in finding a solution.
Explains the concept of cold start and its challenges in multi-region deployments.
Details the four phases of cold start and their optimization strategies.
Provides several best practices for handling cold starts, including optimizing model loading and inference engine initialization.
Summarizes the main findings of the article and emphasizes the importance of optimizing cold start.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- AI 冷启动管理
- 冷启动定义
- 多区域部署挑战
- 冷启动阶段
- 基础设施分配 (~5s)
- 容器镜像流传输 (1-2s)
- 引擎初始化 (5-15s)
- 模型加载 & VRAM 转移
- 最佳实践
- 选择正确的部署选项
- 优化推理引擎初始化
Highlights
Key sentences worth saving and sharing.
Parallel downloading model files can shorten Cloud Storage cold start times to the fastest.
Using FUSE to mount buckets as local file systems simplifies operations but may not be suitable for large models.
Optimizing inference engine initialization can significantly reduce cold start delays.
I saw a developer asking on Reddit if there was any “sane way” to manage Cloud Run cold starts for AI across multiple regions. They were experiencing startup latencies of up to 20 seconds, a frustrating gap where the infrastructure is spinning up while the user waits for a response.
The discussion was full of developers who had almost given up on serverless GPUs, with some even migrating back to GKE just to escape the latency. I decided it was time to dive deep into the Mechanics of AI Cold Starts and see if we could find that "sane way."
During my research into hosting models like Gemma 4 on Cloud Run, I had the privilege of co-presenting at Google Cloud Next '26 with Oded Shahar (Senior Engineering Manager for Cloud Run) and our guest speaker Ajay Nair (Global VP of Platform at Elastic).
In our session, "Build AI Architectures with Custom Models on Cloud Run," Ajay shared the production-hardened strategies that allow Elastic to serve millions of daily requests across 17+ model variants, all while maintaining the 'scale-to-zero' efficiency of Cloud Run.

Build AI Architectures with Custom Models on Cloud Run
Ajay showed us that the secret isn't just in the model, but in treating GPUs as fungible compute rather than infrastructure to manage.
I realized then that minimizing cold start latency isn't just about the model, it's about the infrastructure patterns and architectural decisions that keep it fast, scalable, and secure.
The Anatomy of an AI Cold Start
As the official Google Cloud GPU best practices explain, an AI cold start is a shift from standard web microservices. You aren't just booting code, you're moving gigabytes of weights into a specialized physical accelerator.
Think of it as a four-phase race. If you don't optimize each step, you're going to lose your users.
Phase 1: Infrastructure Provisioning (~5s)
Cloud Run allocates the physical GPU and injects pre-installed NVIDIA drivers. Since Google manages the drivers for you, you don't have to bloat your Dockerfile.
Phase 2: Block-Level Container Image Streaming (1-2s)
Cloud Run uses "image streaming," meaning it pulls only the blocks needed to boot. Your 15GB CUDA image can actually start as fast as a tiny Node.js app!
Phase 3: Engine Initialization (5-15s)
This is where your inference engine (vLLM, Ollama) warms up. This is a massive CPU-heavy task, and it's where most people get throttled without realizing it.
Phase 4: Model Loading & VRAM Transfer
This is the final hurdle - moving those model weights from storage into the GPU memory. Unlike standard web apps where CPU is king, GPU memory is your primary constraint here. If your model’s weights don’t fit entirely within the GPU memory, performance degrades significantly as it swaps to slower system RAM.
Best Practices for Handling AI Cold Starts
To build a "sane" production environment, here are a few crucial levers you can pull, informed by the official Google Cloud documentation on AI inference with GPUs.
Optimize Phase 4
Choose the Right Deployment Option
Phase 4 is the "final hurdle" where you move gigabytes of weights from storage into GPU memory. Your choice of storage determines how fast this transfer happens:
- Cloud Storage (Concurrent Download) - Fastest: Using the Google Cloud CLI (
gcloud storage cp) allows you to download model files in parallel. This is the recommended method for massive weights because it maximizes network throughput and drastically reduces transfer time.
- Cloud Storage (FUSE) - Easiest: This provides "zero-code" changes by mounting a bucket as a local file system. However, because it does not parallelize the initial download, it is significantly slower for large model weights
- Container Image - Best for <10GB: Baking weights into your image is efficient for smaller models thanks to Cloud Run's Image Streaming. For models over 10GB, however, the import and streaming overhead can become a bottleneck.
- Internet: Avoid this. It is the slowest and least predictable path for production inference.
Model Format & Size
Optimizing your model's format and size is a direct "hack" to shorten Phase 4 (Model Loading & VRAM Transfer). Because this phase is constrained by how fast you can move gigabytes of data into VRAM, smaller and more efficient files are critical.
- 4-bit Quantization: This is the ultimate cold start hack. Smaller weights mean fewer gigabytes to pull from storage, which directly accelerates the download and transfer portion of Phase 4,
- 快速格式:选择加载时间快的模型格式,如 GGUF,以最小化启动时间。为了获得最佳性能,远离 Python 的 "pickle" 文件,使用 Safetensors 进行零拷贝加载。
- 确保 VRAM 容量:使用量化模型确保权重完全适合 GPU 内存。如果 模型超过 VRAM,Phase 4 将因系统切换到显著更慢的 RAM 而停滞。
优化 Phases 3 & 4: 基础设施与网络杠杆
这些基础设施设置提供了加速启动过程中最 demanding 部分所需的必要资源。
#### **启动 CPU 加速(加速 Phase 3)**
此功能在启动期间临时将您的 CPU 功率翻倍。一个 1 vCPU 实例在启动期间和前 10 秒的服务时间内提升到 2 vCPUs。对于 Phase 3 来说至关重要,因为引擎初始化是一个巨大的 CPU 密集型任务。
#### **直接 VPC 出站及 PGA(加速 Phase 4)**
利用直接 VPC 出站和私有 Google 访问 (PGA) 可以确保您的模型权重流量留在 Google 的内部高速骨干网上。这优化了网络路径,缩短了将数十亿个权重移动到 VRAM 的时间。
#### 并发调优(避免冷启动):
在 Cloud Run 中,“并发”指的是单个实例在平台扩展以启动新实例之前可以处理的最大请求数量。对于 AI 工作负载,您必须将此设置与模型引擎的内部并行性标志(例如,--max-num-seqs 对于 vLLM 或 OLLAMA_NUM_PARALLEL 对于 Ollama)一起调整。
使用官方的 Google Cloud 公式 找到您的理想 Cloud Run 并发度:
(模型实例数 × 模型实例每秒并行查询数) + (模型实例数 × 最佳批量大小)
示例: 如果您的实例将 3 个模型实例加载到 GPU 上,并且每个模型实例可以处理 4 个并行查询,具有理想的批量大小为 4,则将您的 Cloud Run 最大并发请求设置为 24:(3×4)+(3×4)
数学原理: 目标是使 GPU 完全饱和,同时确保用户不会陷入长时间队列中。在这个例子中,总共有 24 个并发请求被分成两个功能组:
- 活跃处理(12 个请求): 计算为 (3 实例 × 4 查询),这代表 GPU 在任何给定时刻可以主动处理的请求数量。
- “下一个批次”缓冲区(12 个请求): 计算为 (3 实例 × 4 批量大小),这些是容器内等待“上船”的请求。一旦 GPU 完成第一个批次,它会立即接起这些等待的请求。
通过将此值调高到您的 VRAM 允许的范围(通常为 10-20 名用户),一个暖机实例可以处理许多请求而无需触发新的扩展事件及其伴随的冷启动。
#### 缩放控制(调整阈值)
虽然上述公式定义了您的最大容量,但您还可以调整 Cloud Run 决定启动下一实例的时间。Cloud Run 的自动缩放器通常目标为 60% 利用率,但对于长时间运行的 AI 冷启动,您可以将其阈值通过 缩放控制 提高到 80% 或 90%。
- 并发目标: 增加这个值允许您在触发扩展之前将更多请求打包到一个暖机实例中。
- CPU 目标: 增加 CPU 目标可以防止平台仅因初始化或高强度推理导致 CPU 利用率飙升而启动新实例。
缩放与可靠性策略

有时处理冷启动的最佳方法是完全避免它或积极管理它。
单区域“始终在线”权衡
如果您在全球范围内部署,将每个地区的最小实例设置为 1 的成本会累积起来。相反,考虑在一个地区提供“始终在线”服务。100 毫秒的全球网络延迟比 20 秒的本地冷启动用户体验更好。
15 分钟宽限期: 一个常见的问题是“我的实例在请求后会保持温暖多长时间?” Cloud Run 一般会在它们空闲(处理零个请求)后 15 分钟 保持实例存活。如果您的流量可预测并且每隔 10-12 分钟就来一次,您可能根本不需要“始终在线”服务,平台的默认关闭策略会让一个暖机实例免费为您准备好下一个用户。
“唤醒呼叫”策略
Sometimes the best way to handle a cold start is to proactively mask it. If your UI can predict an upcoming request, for example, when a user clicks "New Chat" or begins hovering over a text area, you can send a lightweight health check to your service immediately. By the time the user finishes typing their prompt, the first two phases of the cold start (Infrastructure Provisioning and Container Image Streaming) are already finished in the background.
Pro-Tip: Use Non-Inference Endpoints To make this "wake-up call" as fast as possible, always use a non-inference endpoint rather than sending a dummy prompt like "hi".
- Why it’s faster: Non-inference endpoints (like
/v1/modelsfor vLLM or/api/tagsfor Ollama) are handled by the container’s web server the moment it starts. They don’t have to wait for the slow "Phase 4" model loading and VRAM transfer to complete before sending a success response. - No Chat Pollution: Because these endpoints don't trigger the model's completion logic, they won't interfere with the user's actual chat history or accidentally trigger session creation in your backend.
Recommended Endpoints:
- vLLM:
GET /healthorGET /v1/models - Ollama:
GET /api/tagsorGET /api/version
Tune Startup Probes for VRAM
AI models take significant time to move gigabytes of weights from storage into GPU memory (Phase 4). If your startup check fails too many times, Cloud Run will assume your container is broken and kill it.
To prevent this:
- Increase the Failure Threshold: Use a high
failureThreshold(e.g., 60 or more). Since the total allowed startup time is the product offailureThreshold \times periodSeconds, a threshold of 60 with a 5-second period gives your model a healthy 5-minute window to load. - Utilize the 30-Minute Maximum: While standard services are limited to 4 minutes, Cloud Run supports a total startup time of up to 30 minutes (1,800 seconds) for intensive workloads.
- Avoid False Positives (The Ollama Fix): Be careful with engines like Ollama, which may open a TCP port as soon as the service starts, but before the model is actually in VRAM. Always ensure you are preloading models during the container's entrypoint script to ensure the startup probe only passes once the model is truly ready for inference.
Lessons from Elastic’s Strategy
In our NEXT ‘26 session, Ajay Nair highlighted three architectural decisions that allowed Elastic to treat GPUs as fungible compute, rather than infrastructure to manage:
- Bypass the Compilation Tax: By setting
enforce_eager=Truein vLLM, they traded a tiny bit of throughput for cold starts that finish in less than a minute rather than multiple minutes. - Standalone Checkpoints: They avoided the latency of runtime adapter-switching by pre-merging each LoRA variant into a standalone checkpoint.
- One Workload, One Service: Each independently-scalable workload — defined by model, task adapter, and traffic shape — is deployed as its own Cloud Run service. This produces 30+ services across ~15 model families, with some models split by task (e.g., v5 retrieval vs. clustering) or by query/passage role.
Ready to get started?
Optimizing the cold start process is the difference between a hobby project and a production-ready application. The best part? Cloud Run handles the NVIDIA driver and CUDA installation for you, starting the instance in about 5 seconds.
For a deeper dive, the official documentation is your best friend:
- Best practices: AI inference on Cloud Run with GPUs
- Configure GPU for Cloud Run services
- Startup CPU boost for Cloud Run
For the full technical breakdown, I highly recommend watching the recording of the session from Google Cloud Next '26. It provides the most comprehensive blueprint for hosting high-performance open models on serverless infrastructure."
Happy building!
- * *
_Special thanks to Sara Ford and Shane Ouchi from the Cloud Run team and to Zac Li from Elastic for the helpful review and feedback on this article._
Posted in