为无服务器性能与可靠性重构分布式系统

Q: 核心设计理念

解耦计算、存储与元数据，实现独立伸缩。

Q: 弹性与容错机制

自动扩缩容与故障恢复保障高可用性。

Q: 性能优化实践

通过缓存、预热与智能调度减少延迟。

Q: 未来展望

向完全自治的分布式系统演进。

Databricks

Databricks2026年5月6日

Rethinking Distributed Systems for Serverless Performance and Reliability

7.8Score

TL;DR · AI Summary

Databricks proposes re-architecting distributed systems for serverless environments by decoupling compute, storage, and metadata to improve performance and reliability.

Key Takeaways

Traditional distributed systems must be rethought for serverless; decoupling is
A unified metadata layer enables cross-service consistency and zero-copy sharing
Auto-scaling and self-healing mechanisms greatly enhance system reliability.

Outline

Jump quickly between sections.

§引言：无服务器的新挑战
介绍无服务器架构对传统分布式系统的冲击。
·核心设计理念
解耦计算、存储与元数据，实现独立伸缩。
·统一元数据层的作用
Unity Catalog 提供一致的数据治理与共享能力。
·弹性与容错机制
自动扩缩容与故障恢复保障高可用性。
·性能优化实践
通过缓存、预热与智能调度减少延迟。
§未来展望
向完全自治的分布式系统演进。

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

无服务器分布式系统重构
- 架构解耦
  - 计算与存储分离
  - 元数据独立管理
- 核心组件
  - Unity Catalog
  - Delta Lake
  - Serverless Compute
- 关键能力
  - 自动弹性
  - 故障自愈
  - 低延迟调度

Highlights

Key sentences worth saving and sharing.

To achieve true serverless reliability, we must decouple not just compute from storage, but also metadata management.
— Section 2
⬇︎ 下载 PNG 𝕏 分享到 X
The unified metadata plane enables zero-copy sharing and cross-workload consistency at scale.
— Section 3
⬇︎ 下载 PNG 𝕏 分享到 X
Automatic scaling is not enough — intelligent backpressure and failure isolation are critical for stability.
— Section 4
⬇︎ 下载 PNG 𝕏 分享到 X
We re-architected our distributed coordination protocols to handle millions of ephemeral serverless nodes.
— Section 5
⬇︎ 下载 PNG 𝕏 分享到 X
Performance is no longer just about throughput — it’s about predictability and tail latency.
— Section 6
⬇︎ 下载 PNG 𝕏 分享到 X
The future lies in self-healing, self-optimizing systems powered by AI-driven observability.
— Conclusion
⬇︎ 下载 PNG 𝕏 分享到 X

#Databricks#Serverless#Distributed Systems#Lakehouse#Metadata Management

Open original article

Rethinking Distributed Systems for Serverless Performance and Reliability | Databricks Blog

Skip to main content

[![Image 1](blob:http://localhost/c3d26385bd032c882a09c45135533626)](http://www.databricks.com/)

[![Image 2](blob:http://localhost/c3d26385bd032c882a09c45135533626)](http://www.databricks.com/)

Why Databricks

* Discover

For App Developers

For Executives

For Startups

Lakehouse Architecture

Databricks AI Research

Customers

Customer Stories

Partners

Partner Overview Explore the Databricks partner ecosystem

Partner Program Explore benefits, tiers and how to become a partner

Find a Partner Discover Databricks partners for your needs

Partner Spotlight Featured partner announcements

Cloud Providers Databricks on AWS, Azure and GCP

Partner Solutions Find custom industry and migration solutions

Product

* Databricks Platform

Platform Overview A unified platform for data, analytics and AI

Sharing Open, secure, zero-copy sharing for all data

Governance Unified governance for all data, analytics and AI assets

Artificial Intelligence Build and deploy ML and GenAI applications

Business Intelligence Intelligent analytics for real-world data

Database Postgres for data apps and AI agents

Data Management Data reliability, security and performance

Data Warehousing Serverless data warehouse for SQL analytics

Data Engineering ETL and orchestration for batch and streaming data

Business Productivity Unified search, chat, dashboards and apps

Application Development Quickly build secure data and AI apps

Security Open agentic SIEM built for the AI era

Integrations and Data

Marketplace Open marketplace for data, analytics and AI

IDE Integrations Build on the Lakehouse in your favorite IDE

Partner Connect Discover and integrate with the Databricks ecosystem

Pricing

Databricks Pricing Explore product pricing, DBUs and more

Cost Calculator Estimate your compute costs on any cloud

Open Source

Open Source Technologies Learn more about the innovations behind the platform

Solutions

* Databricks for Industries

Communications

Financial Services

Healthcare & Life Sciences

Manufacturing

Media and Entertainment

Public Sector

Retail

See All Industries

Cross Industry Solutions

AI Agents

AI Governance

Cybersecurity

Marketing

Migration & Deployment

Data Migration

Professional Services

Solution Accelerators

Explore Accelerators Move faster toward outcomes that matter

Resources

* Learning

Training Discover curriculum tailored to your needs

Databricks Academy Sign in to the Databricks learning platform

Certification Gain recognition and differentiation

Free Edition Learn professional Data and AI tools for free

University Alliance Want to teach Databricks? See how.

Events

Data + AI Summit

Data + AI World Tour

AI Days

Event Calendar

Blog and Podcasts

Databricks Blog Explore news, product announcements, and more

AI Blog Explore our AI research and engineering work

Data Brew Podcast Let’s talk data!

Champions of Data + AI Podcast Insights from data leaders powering innovation

Get Help

Customer Support

Documentation

Community

Dive Deep

Resource Center

Demo Center

Architecture Center

About

* Company

Who We Are

Our Team

Databricks Ventures

Contact Us

Careers

Working at Databricks

Open Jobs

Press

Awards and Recognition

Newsroom

Security and Trust

Security and Trust

DATA + AI SUMMIT ![Image 3: Data+ai summit promo JUNE 15–18|SAN FRANCISCO Join us at the world’s largest data, apps and AI event. Register](http://www.databricks.com/dataaisummit?itm_source=www&itm_category=home&itm_page=home&itm_location=navigation&itm_component=navigation&itm_offer=dataaisummit)

All blogs
/ Platform

Table of contents

ProductMay 6, 2026

Rethinking Distributed Systems for Serverless Performance and Reliability

by Aaron Davidson, Roland Fäustlin and Zach Williams

Summary

Building truly serverless compute required rethinking core assumptions in distributed systems to eliminate user-managed infrastructure and improve stability.
Separating applications from compute, intelligently routing workloads, and dynamically scaling resources addresses instability and unpredictable performance in traditional clusters.
These architectural innovations deliver more stable, predictable, and cost-efficient performance by automatically optimizing infrastructure without user intervention.

Building truly serverless compute for Apache Spark required solving fundamental architectural challenges that have existed since Spark’s inception. The complexity goes far beyond simply creating warm pools of machines or implementing basic autoscaling. It required rethinking core assumptions about how distributed computing systems should operate.

Traditional Spark deployments expose infrastructure directly to users, creating tight coupling between applications and compute. Workloads compete for shared resources, small inefficiencies can cascade into failures, and users are forced to manually balance performance, cost, and reliability. As demand changes, systems struggle to maintain both high utilization and predictable performance.

Serverless compute takes a different approach by fully managing the infrastructureso that the user can focus on the data and insights. Stability becomes a system property rather than a user responsibility, enabled by architectures that isolate workloads, intelligently place them, and dynamically adapt resources.

Serverless compute is designed to improve stability, performance, and operational simplicity. Three core systems make this possible:

Spark Connect, which separates user applications from compute infrastructure
TheServerless Gateway, which intelligently routes workloads across compute resources
Anadaptive autoscaler, which continuously optimizes cluster size for performance and cost

Together, these systems enable a model where performance is achieved by first ensuring stability across the system.

Image 4: Versionless – How Does It Work?

Expand

Spark Connect: Stability Through Isolation

Spark Connect represents the most significant architectural transformation in Spark's history, a complete departure from the monolithic design that has defined distributed computing for over a decade. In traditional architectures, user applications run directly on the same machine as the Spark driver, creating tight coupling that introduces critical limitations. When multiple applications compete for resources on the same cluster or when user code consumes excessive memory or CPU, the system becomes unstable, leading to failures that can cascade across workloads.

Spark Connect introduces a client-server architecture in which applications communicate with the Spark driver over gRPC, and the driver executes queries on behalf of the client rather than running user processes directly. This shifts the unit of execution from application processes to queries and enables a clean separation between user applications and infrastructure.

This decoupling significantly improves reliability and allows the platform to manage drivers independently of user workloads. By isolating applications from compute, Spark Connect creates the foundation required for stable multi-tenant execution and enables more advanced resource management across the system.

This architecture enables Databricks to deliver more than 25 major Spark runtime upgrades per year with a 99.998% success rate across more than 2 billion workloads, with no user action required.¹

The Gateway: Balancing Efficiency and Predictability

Distributed systems have long faced a fundamental tension between efficiency and predictability. Maximizing utilization often leads to resource contention, while isolating workloads can result in underutilized capacity. Traditional cluster models force users to navigate this tradeoff manually, often resulting in unpredictable performance or unreliable execution as workloads change.

Consider what happens when dozens of queries land simultaneously: some small exploratory scans running against sample data, others large production ETL jobs processing hundreds of gigabytes. A naive router treats them identically, forcing large jobs to wait behind small ones or letting workloads compete for the same cluster, leading to unpredictable performance degradation. This dynamic makes it difficult to deliver both high utilization and consistent performance in shared environments.

The Databricks gateway routes each workload by evaluating three real-time signals: estimated query size (derived from the logical plan), current utilization across the cluster pool, and latency profile: whether a session is interactive and latency-sensitive or a batch job optimized for throughput. A small exploratory query gets routed to a lightly loaded cluster that can respond in seconds; a heavy ETL job gets directed to a cluster with available headroom for its data volume, or the autoscaler is signaled to provision one. When conditions shift (a cluster fills up, a long-running job finishes, a new cluster comes online), the gateway continuously re-evaluates placements and corrects routing without user intervention. The result: workloads are insulated from each other. A runaway query on one cluster doesn't delay queries on another, and the system maintains high utilization without sacrificing predictability.

Expand

Autoscaling: Optimizing the Cost-Performance Curve

Dynamic cluster sizing is the primary mechanism for optimizing price-performance in distributed systems, but determining the optimal amount of compute is inherently complex. The optimal configuration depends on workload characteristics, data size, and the relative importance of latency versus cost, with no single configuration working across all scenarios. Databricks serverless offerstwo modes to fit different needs: Standard, which uses less compute to reduce costs, and Performance-Optimized, which delivers faster startup and execution for time-sensitive workloads.

Startup is a priority for us, and serverless Notebooks and Workflows have made a huge difference. Serverless compute for notebooks makes it easy with just a single click. — Chiranjeevi Katta, Data Engineer at Airbus

Databricks helped us move to serverless compute, while eliminating redundant workflows. These efficiencies put us in position to lower operational costs by 25%. Pipelines on our legacy infrastructure previously took hours to process. Now, they run 2 to 5 times faster. — Evan Cherney, Senior Data Science Manager at Unilever

Traditional autoscaling approaches rely on static rules and reactive thresholds, which often fail to capture these nuances. As a result, clusters are frequently under or over-provisioned, leading to inefficiency, instability, or both.

Serverless autoscaling takes a more adaptive approach. By continuously analyzing workload patterns and system-wide signals, the autoscaler positions each workload on the optimal cost-performance curve, where most manually configured clusters fall short, delivering worse performance and higher cost due to the difficulty of correctly sizing distributed systems. It dynamically adjusts compute capacity by scaling horizontally and vertically as needed, preventing out-of-memory failures and maintaining stability as workloads grow. When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required.

The impact is measurable. CKDelta reported jobs completing in 20 minutes that previously ran for 4–5 hours. Unilever saw pipelines running 2–5x faster with operational costs down 25%. HP realized cloud savings of over 32% and decreased combined job runtime by 36%.

Together, Spark Connect, the gateway, and the autoscaler enable a fundamentally different operating model for Spark. Workloads are isolated, intelligently placed, and dynamically resourced without user intervention. By addressing stability at the architectural level, serverless compute can deliver strong performance while maintaining reliability, allowing users to focus on building data and AI workloads rather than managing infrastructure.

¹ Justin Breese et al., "Blink Twice: Automatic Workload Pinning and Regression Detection for Versionless Apache Spark using Retries," SIGMOD/PODS '25, pp. 103–106. https://doi.org/10.1145/3722212.3725084

Start Your Serverless Journey Today

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

Sign up

*

Work Email

*

Country Country*

By clicking “Subscribe” I understand that I will receive Databricks communications, and I agree to Databricks processing my personal data in accordance with its Privacy Policy.

Subscribe

View all blogs

Why Databricks

Discover

Customers

Customer Stories

Partners

Why Databricks

Discover

Customers

Customer Stories

Partners

Product

Databricks Platform

Pricing

Open Source

Integrations and Data

Product

Databricks Platform

Pricing

Open Source

Integrations and Data

Solutions

Databricks For Industries

Cross Industry Solutions

Data Migration

Professional Services

Solution Accelerators

Solutions

Databricks For Industries

Cross Industry Solutions

Data Migration

Professional Services

Solution Accelerators

Resources

Documentation

Customer Support

Community

Learning

Events

Blog and Podcasts

Resources

Documentation

Customer Support

Community

Learning

Events

Blog and Podcasts

About

Company

Careers

Press

Security and Trust

About

Company

Careers

Press

Security and Trust

Databricks Inc.

160 Spear Street, 15th Floor

San Francisco, CA 94105

1-866-330-0121

[](https://www.linkedin.com/company/databricks)
[](https://www.facebook.com/pages/Databricks/560203607379694)
[](https://twitter.com/databricks)
[](https://www.databricks.com/feed)
[](https://www.glassdoor.com/Overview/Working-at-Databricks-EI_IE954734.11,21.htm)
[](https://www.youtube.com/@Databricks)

See Careers

at Databricks

[](https://www.linkedin.com/company/databricks)
[](https://www.facebook.com/pages/Databricks/560203607379694)
[](https://twitter.com/databricks)
[](https://www.databricks.com/feed)
[](https://www.glassdoor.com/Overview/Working-at-Databricks-EI_IE954734.11,21.htm)
[](https://www.youtube.com/@Databricks)

We Care About Your Privacy

Databricks uses cookies and similar technologies to enhance site navigation, analyze site usage, personalize content and ads, and as further described in our Cookie Notice. To disable non-essential cookies, click “Reject All”. You can also manage your cookie settings by clicking “Manage Preferences.”

Manage Preferences

Reject All Accept All

Privacy Preference Center

Opt-Out Preference Signal Honored

Privacy Preference Center

### Your Privacy
### Strictly Necessary Cookies
### Performance Cookies
### Functional Cookies
### Targeting Cookies
### TOTHR

#### Your Privacy

When you visit any website, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and change our default settings. However, blocking some types of cookies may impact your experience of the site and the services we are able to offer.

#### Opting out of sales, sharing, and targeted advertising

Depending on your location, you may have the right to opt out of the “sale” or “sharing” of your personal information or the processing of your personal information for purposes of online “targeted advertising.” You can opt out based on cookies and similar identifiers by disabling optional cookies here. To opt out based on other identifiers (such as your email address), submit a request in our Privacy Request Center.

More information

#### Strictly Necessary Cookies

Always Active

These cookies are necessary for the website to function and cannot be switched off in our systems. They assist with essential site functionality such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will no longer work.

#### Performance Cookies

[x] Performance Cookies

These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site.

#### Functional Cookies

[x] Functional Cookies

These cookies enable the website to provide enhanced functionality and personalization. They may be set by us or by third party providers whose services we have added to our pages. If you do not allow these cookies then some or all of these services may not function properly.

#### Targeting Cookies

[x] Targeting Cookies

These cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant advertisements on other sites. If you do not allow these cookies, you will experience less targeted advertising.

#### TOTHR

[x] TOTHR

Cookie List

Consent Leg.Interest

[x] checkbox label label

[x] checkbox label label

[x] checkbox label label

Clear

- [x] checkbox label label

Apply Cancel

Confirm My Choices

Allow All

Rethinking Distributed Systems for Serverless Performance and Reliability

TL;DR · AI Summary

Key Takeaways

Outline

Mindmap

Highlights

Rethinking Distributed Systems for Serverless Performance and Reliability | Databricks Blog

Rethinking Distributed Systems for Serverless Performance and Reliability

**Spark Connect: Stability Through Isolation**

**The Gateway: Balancing Efficiency and Predictability**

**Autoscaling: Optimizing the Cost-Performance Curve**

**Start Your Serverless Journey Today**

Get the latest posts in your inbox

Sign up

We Care About Your Privacy

Privacy Preference Center

Privacy Preference Center

Cookie List

Spark Connect: Stability Through Isolation

The Gateway: Balancing Efficiency and Predictability

Autoscaling: Optimizing the Cost-Performance Curve

Start Your Serverless Journey Today