T
traeai
Sign in
返回首页
Databricks

Rethinking Distributed Systems for Serverless Performance and Reliability

7.8Score
Rethinking Distributed Systems for Serverless Performance and Reliability

TL;DR · AI Summary

Databricks proposes re-architecting distributed systems for serverless environments by decoupling compute, storage, and metadata to improve performance and reliability.

Key Takeaways

  • Traditional distributed systems must be rethought for serverless; decoupling is
  • A unified metadata layer enables cross-service consistency and zero-copy sharing
  • Auto-scaling and self-healing mechanisms greatly enhance system reliability.

Outline

Jump quickly between sections.

  1. 介绍无服务器架构对传统分布式系统的冲击。

  2. 解耦计算、存储与元数据,实现独立伸缩。

  3. Unity Catalog 提供一致的数据治理与共享能力。

  4. 自动扩缩容与故障恢复保障高可用性。

  5. 通过缓存、预热与智能调度减少延迟。

  6. 向完全自治的分布式系统演进。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • 无服务器分布式系统重构
    • 架构解耦
      • 计算与存储分离
      • 元数据独立管理
    • 核心组件
      • Unity Catalog
      • Delta Lake
      • Serverless Compute
    • 关键能力
      • 自动弹性
      • 故障自愈
      • 低延迟调度

Highlights

Key sentences worth saving and sharing.

  • To achieve true serverless reliability, we must decouple not just compute from storage, but also metadata management.

    Section 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The unified metadata plane enables zero-copy sharing and cross-workload consistency at scale.

    Section 3

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Automatic scaling is not enough — intelligent backpressure and failure isolation are critical for stability.

    Section 4

    ⬇︎ 下载 PNG𝕏 分享到 X
  • We re-architected our distributed coordination protocols to handle millions of ephemeral serverless nodes.

    Section 5

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Performance is no longer just about throughput — it’s about predictability and tail latency.

    Section 6

    ⬇︎ 下载 PNG𝕏 分享到 X
  • The future lies in self-healing, self-optimizing systems powered by AI-driven observability.

    Conclusion

    ⬇︎ 下载 PNG𝕏 分享到 X
#Databricks#Serverless#Distributed Systems#Lakehouse#Metadata Management
Open original article

Rethinking Distributed Systems for Serverless Performance and Reliability | Databricks Blog

Skip to main content

[![Image 1](blob:http://localhost/c3d26385bd032c882a09c45135533626)](http://www.databricks.com/)

[![Image 2](blob:http://localhost/c3d26385bd032c882a09c45135533626)](http://www.databricks.com/)

  • Why Databricks
  • * Discover
  • Customers
  • Partners
  • Product
  • * Databricks Platform
  • Integrations and Data
  • Pricing
  • Open Source
  • Solutions
  • * Databricks for Industries
  • Cross Industry Solutions
  • Migration & Deployment
  • Solution Accelerators
  • Resources
  • * Learning
  • Events
  • Blog and Podcasts
  • Get Help
  • Dive Deep
  • About
  • * Company
  • Careers
  • Press
  • Security and Trust
  • DATA + AI SUMMIT ![Image 3: Data+ai summit promo JUNE 15–18|SAN FRANCISCO Join us at the world’s largest data, apps and AI event. Register](http://www.databricks.com/dataaisummit?itm_source=www&itm_category=home&itm_page=home&itm_location=navigation&itm_component=navigation&itm_offer=dataaisummit)
  1. All blogs
  2. / Platform

Table of contents

Table of contents

Table of contents

ProductMay 6, 2026

Rethinking Distributed Systems for Serverless Performance and Reliability

by Aaron Davidson, Roland Fäustlin and Zach Williams

Summary

  • Building truly serverless compute required rethinking core assumptions in distributed systems to eliminate user-managed infrastructure and improve stability.
  • Separating applications from compute, intelligently routing workloads, and dynamically scaling resources addresses instability and unpredictable performance in traditional clusters.
  • These architectural innovations deliver more stable, predictable, and cost-efficient performance by automatically optimizing infrastructure without user intervention.

Building truly serverless compute for Apache Spark required solving fundamental architectural challenges that have existed since Spark’s inception. The complexity goes far beyond simply creating warm pools of machines or implementing basic autoscaling. It required rethinking core assumptions about how distributed computing systems should operate.

Traditional Spark deployments expose infrastructure directly to users, creating tight coupling between applications and compute. Workloads compete for shared resources, small inefficiencies can cascade into failures, and users are forced to manually balance performance, cost, and reliability. As demand changes, systems struggle to maintain both high utilization and predictable performance.

Serverless compute takes a different approach by fully managing the infrastructureso that the user can focus on the data and insights. Stability becomes a system property rather than a user responsibility, enabled by architectures that isolate workloads, intelligently place them, and dynamically adapt resources.

Serverless compute is designed to improve stability, performance, and operational simplicity. Three core systems make this possible:

  1. Spark Connect, which separates user applications from compute infrastructure
  2. TheServerless Gateway, which intelligently routes workloads across compute resources
  3. Anadaptive autoscaler, which continuously optimizes cluster size for performance and cost

Together, these systems enable a model where performance is achieved by first ensuring stability across the system.

Image 4: Versionless – How Does It Work?

Expand

**Spark Connect: Stability Through Isolation**

Spark Connect represents the most significant architectural transformation in Spark's history, a complete departure from the monolithic design that has defined distributed computing for over a decade. In traditional architectures, user applications run directly on the same machine as the Spark driver, creating tight coupling that introduces critical limitations. When multiple applications compete for resources on the same cluster or when user code consumes excessive memory or CPU, the system becomes unstable, leading to failures that can cascade across workloads.

Spark Connect introduces a client-server architecture in which applications communicate with the Spark driver over gRPC, and the driver executes queries on behalf of the client rather than running user processes directly. This shifts the unit of execution from application processes to queries and enables a clean separation between user applications and infrastructure.

This decoupling significantly improves reliability and allows the platform to manage drivers independently of user workloads. By isolating applications from compute, Spark Connect creates the foundation required for stable multi-tenant execution and enables more advanced resource management across the system.

This architecture enables Databricks to deliver more than 25 major Spark runtime upgrades per year with a 99.998% success rate across more than 2 billion workloads, with no user action required.¹

**The Gateway: Balancing Efficiency and Predictability**

Distributed systems have long faced a fundamental tension between efficiency and predictability. Maximizing utilization often leads to resource contention, while isolating workloads can result in underutilized capacity. Traditional cluster models force users to navigate this tradeoff manually, often resulting in unpredictable performance or unreliable execution as workloads change.

Consider what happens when dozens of queries land simultaneously: some small exploratory scans running against sample data, others large production ETL jobs processing hundreds of gigabytes. A naive router treats them identically, forcing large jobs to wait behind small ones or letting workloads compete for the same cluster, leading to unpredictable performance degradation. This dynamic makes it difficult to deliver both high utilization and consistent performance in shared environments.

The Databricks gateway routes each workload by evaluating three real-time signals: estimated query size (derived from the logical plan), current utilization across the cluster pool, and latency profile: whether a session is interactive and latency-sensitive or a batch job optimized for throughput. A small exploratory query gets routed to a lightly loaded cluster that can respond in seconds; a heavy ETL job gets directed to a cluster with available headroom for its data volume, or the autoscaler is signaled to provision one. When conditions shift (a cluster fills up, a long-running job finishes, a new cluster comes online), the gateway continuously re-evaluates placements and corrects routing without user intervention. The result: workloads are insulated from each other. A runaway query on one cluster doesn't delay queries on another, and the system maintains high utilization without sacrificing predictability.

Image 5: Flow Diagram

Expand

**Autoscaling: Optimizing the Cost-Performance Curve**

Dynamic cluster sizing is the primary mechanism for optimizing price-performance in distributed systems, but determining the optimal amount of compute is inherently complex. The optimal configuration depends on workload characteristics, data size, and the relative importance of latency versus cost, with no single configuration working across all scenarios. Databricks serverless offerstwo modes to fit different needs: Standard, which uses less compute to reduce costs, and Performance-Optimized, which delivers faster startup and execution for time-sensitive workloads.

Startup is a priority for us, and serverless Notebooks and Workflows have made a huge difference. Serverless compute for notebooks makes it easy with just a single click. — Chiranjeevi Katta, Data Engineer at Airbus

Databricks helped us move to serverless compute, while eliminating redundant workflows. These efficiencies put us in position to lower operational costs by 25%. Pipelines on our legacy infrastructure previously took hours to process. Now, they run 2 to 5 times faster. — Evan Cherney, Senior Data Science Manager at Unilever

Traditional autoscaling approaches rely on static rules and reactive thresholds, which often fail to capture these nuances. As a result, clusters are frequently under or over-provisioned, leading to inefficiency, instability, or both.

Serverless autoscaling takes a more adaptive approach. By continuously analyzing workload patterns and system-wide signals, the autoscaler positions each workload on the optimal cost-performance curve, where most manually configured clusters fall short, delivering worse performance and higher cost due to the difficulty of correctly sizing distributed systems. It dynamically adjusts compute capacity by scaling horizontally and vertically as needed, preventing out-of-memory failures and maintaining stability as workloads grow. When a task encounters an out-of-memory error, the autoscaler automatically detects it, restarts the task on a larger VM, and continues the job with no manual intervention or job failure required.

The impact is measurable. CKDelta reported jobs completing in 20 minutes that previously ran for 4–5 hours. Unilever saw pipelines running 2–5x faster with operational costs down 25%. HP realized cloud savings of over 32% and decreased combined job runtime by 36%.

Together, Spark Connect, the gateway, and the autoscaler enable a fundamentally different operating model for Spark. Workloads are isolated, intelligently placed, and dynamically resourced without user intervention. By addressing stability at the architectural level, serverless compute can deliver strong performance while maintaining reliability, allowing users to focus on building data and AI workloads rather than managing infrastructure.

¹ Justin Breese et al., "Blink Twice: Automatic Workload Pinning and Regression Detection for Versionless Apache Spark using Retries," SIGMOD/PODS '25, pp. 103–106. https://doi.org/10.1145/3722212.3725084

**Start Your Serverless Journey Today**

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

Sign up

*

Work Email

*

Country Country*

By clicking “Subscribe” I understand that I will receive Databricks communications, and I agree to Databricks processing my personal data in accordance with its Privacy Policy.

Subscribe

View all blogs

Image 6: databricks logo

Why Databricks

Discover

Customers

Partners

Why Databricks

Discover

Customers

Partners

Product

Databricks Platform

Pricing

Open Source

Integrations and Data

Product

Databricks Platform

Pricing

Open Source

Integrations and Data

Solutions

Databricks For Industries

Cross Industry Solutions

Data Migration

Professional Services

Solution Accelerators

Solutions

Databricks For Industries

Cross Industry Solutions

Data Migration

Professional Services

Solution Accelerators

Resources

Documentation

Customer Support

Community

Learning

Events

Blog and Podcasts

Resources

Documentation

Customer Support

Community

Learning

Events

Blog and Podcasts

About

Company

Careers

Press

Security and Trust

About

Company

Careers

Press

Security and Trust

Image 8: databricks logo

Databricks Inc.

160 Spear Street, 15th Floor

San Francisco, CA 94105

1-866-330-0121

  • [](https://www.linkedin.com/company/databricks)
  • [](https://www.facebook.com/pages/Databricks/560203607379694)
  • [](https://twitter.com/databricks)
  • [](https://www.databricks.com/feed)
  • [](https://www.glassdoor.com/Overview/Working-at-Databricks-EI_IE954734.11,21.htm)
  • [](https://www.youtube.com/@Databricks)
Image 10

See Careers

at Databricks

  • [](https://www.linkedin.com/company/databricks)
  • [](https://www.facebook.com/pages/Databricks/560203607379694)
  • [](https://twitter.com/databricks)
  • [](https://www.databricks.com/feed)
  • [](https://www.glassdoor.com/Overview/Working-at-Databricks-EI_IE954734.11,21.htm)
  • [](https://www.youtube.com/@Databricks)

© Databricks 2026. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the Apache Software Foundation.

We Care About Your Privacy

Databricks uses cookies and similar technologies to enhance site navigation, analyze site usage, personalize content and ads, and as further described in our Cookie Notice. To disable non-essential cookies, click “Reject All”. You can also manage your cookie settings by clicking “Manage Preferences.”

Manage Preferences

Reject All Accept All

Image 14: Databricks Company Logo

Privacy Preference Center

Opt-Out Preference Signal Honored

Privacy Preference Center

  • ### Your Privacy
  • ### Strictly Necessary Cookies
  • ### Performance Cookies
  • ### Functional Cookies
  • ### Targeting Cookies
  • ### TOTHR

#### Your Privacy

When you visit any website, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and change our default settings. However, blocking some types of cookies may impact your experience of the site and the services we are able to offer.

#### Opting out of sales, sharing, and targeted advertising

Depending on your location, you may have the right to opt out of the “sale” or “sharing” of your personal information or the processing of your personal information for purposes of online “targeted advertising.” You can opt out based on cookies and similar identifiers by disabling optional cookies here. To opt out based on other identifiers (such as your email address), submit a request in our Privacy Request Center.

More information

#### Strictly Necessary Cookies

Always Active

These cookies are necessary for the website to function and cannot be switched off in our systems. They assist with essential site functionality such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will no longer work.

#### Performance Cookies

  • [x] Performance Cookies

These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site.

#### Functional Cookies

  • [x] Functional Cookies

These cookies enable the website to provide enhanced functionality and personalization. They may be set by us or by third party providers whose services we have added to our pages. If you do not allow these cookies then some or all of these services may not function properly.

#### Targeting Cookies

  • [x] Targeting Cookies

These cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant advertisements on other sites. If you do not allow these cookies, you will experience less targeted advertising.

#### TOTHR

  • [x] TOTHR

Cookie List

Consent Leg.Interest

  • [x] checkbox label label
  • [x] checkbox label label
  • [x] checkbox label label

Clear

  • - [x] checkbox label label

Apply Cancel

Confirm My Choices

Allow All

Image 15: Powered by Onetrust
Image 16

Image 17Image 18

Image 19

AI may generate inaccurate information. Please verify important content.