Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale

TL;DR · AI Summary
Discord rebuilds its database operations around automation to manage large-scale ScyllaDB clusters, achieving zero-downtime upgrades, automatic failure recovery, and elastic scaling through self-developed operation platforms to support real-time communication services for millions of users.
Key Takeaways
- Discord built an automated operation platform based on ScyllaDB supporting zero-
- Self-developed tools enable elastic scaling of database clusters to handle traff
- Automated operation system significantly reduces manual intervention needs and i
Outline
Jump quickly between sections.
Discord faces challenges managing large-scale ScyllaDB clusters where traditional manual operations cannot meet business growth requirements.
Discord redesigned database operation processes with automation as the core principle to manage ScyllaDB clusters.
Automated tools achieve efficient management and maintenance of ScyllaDB under large-scale deployment.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Discord ScyllaDB自动化运维
- 自动化运维体系
- 零停机升级
- 自动故障恢复
- ScyllaDB集群管理
- 弹性扩缩容
- 大规模部署
- 运维效率提升
- 减少人工干预
- 系统稳定性
Highlights
Key sentences worth saving and sharing.
Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale
Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale - InfoQ
Your choice regarding cookies on this site
We use cookies to optimise site functionality and give you the best possible experience.
I Accept I Do Not Accept Settings
[BT](https://www.infoq.com/int/bt/ "bt")
InfoQ Software Architects' Newsletter
A monthly overview of things you need to know as an architect or aspiring architect.
Enter your e-mail address
Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.
Close
Live Webinar and Q&A: Shipping Faster, Breaking More: Rethinking Delivery Systems in the Age of AI (May 28, 2026)Save Your Seat
Close
Toggle Navigation
Facilitating the Spread of Knowledge and Innovation in Professional Software Development
English edition
[Write for InfoQ](https://www.infoq.com/write-for-infoq/ "Write for InfoQ")
Search
Unlock the full InfoQ experience
Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.
or
Don't have an InfoQ account?
- Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
- Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
- Save articles and read at anytimeBookmark articles to read whenever youre ready.
NewsArticlesPresentationsPodcastsGuides
Topics
[Development](https://www.infoq.com/development/ "Development")
- [Java](https://www.infoq.com/java/ "Java")
- [Kotlin](https://www.infoq.com/kotlin/ "Kotlin")
- [.Net](https://www.infoq.com/dotnet/ ".Net")
- [C#](https://www.infoq.com/c_sharp/ "C#")
- [Swift](https://www.infoq.com/swift/ "Swift")
- [Go](https://www.infoq.com/golang/ "Go")
- [Rust](https://www.infoq.com/rust/ "Rust")
- [JavaScript](https://www.infoq.com/javascript/ "JavaScript")
Featured in Development
Dany Lepage discusses the architectural journey of porting a hit VR title to seven non-VR platforms. He explains how his team solved the challenges of cross-progression, diverse input paradigms, and maintaining release velocity across Steam, iOS, and PlayStation. Beyond the tech, he shares candid lessons on the "product fit" gap when translating immersive social presence to 2D screens.

All in developmentFollow Topic
[Architecture & Design](https://www.infoq.com/architecture-design/ "Architecture & Design")
- [Architecture](https://www.infoq.com/architecture/ "Architecture")
- [Enterprise Architecture](https://www.infoq.com/enterprise-architecture/ "Enterprise Architecture")
- [Scalability/Performance](https://www.infoq.com/performance-scalability/ "Scalability/Performance")
- [Design](https://www.infoq.com/design/ "Design")
- [Case Studies](https://www.infoq.com/Case_Study/ "Case Studies")
- [Microservices](https://www.infoq.com/microservices/ "Microservices")
- [Service Mesh](https://www.infoq.com/servicemesh/ "Service Mesh")
- [Patterns](https://www.infoq.com/DesignPattern/ "Patterns")
- [Security](https://www.infoq.com/Security/ "Security")
Featured in Architecture & Design
- #### Context is the Key to the Agentic Architecture Revolution: A Conversation with Baruch Sadogursky
Michael Stiefel spoke to Baruch Sadogursky about software architecture in the age of agentic AI. LLM can function, albeit stochastically, as reasoning machines capable of interpreting human ambiguity. With the appropriate rigorous context artifacts to control the LLM’s reasoning, software specifications can become the source of truth, while the code becomes a disposable intermediate language.

All in architecture-designFollow Topic
[AI Infrastructure](https://www.infoq.com/ai-ml-data-eng/ "AI Infrastructure")
- [Big Data](https://www.infoq.com/bigdata/ "Big Data")
- [Machine Learning](https://www.infoq.com/machinelearning/ "Machine Learning")
- [NoSQL](https://www.infoq.com/nosql/ "NoSQL")
- [Database](https://www.infoq.com/database/ "Database")
- [Data Analytics](https://www.infoq.com/data-analytics/ "Data Analytics")
- [Streaming](https://www.infoq.com/streaming/ "Streaming")
Featured in AI, ML & Data Engineering
Ian Thomas shares a case study on embracing AI-native engineering within Meta’s Reality Labs. He explains the "Assess and Grow" framework, a maturity model designed to move teams from manual toil to AI-integrated innovation. He discusses real-world wins - including hitting 90% code coverage in record time - while addressing senior concerns like "code slop," review fatigue, and maintaining quality.

All in ai-ml-data-engFollow Topic
[Culture & Methods](https://www.infoq.com/culture-methods/ "Culture & Methods")
- [Agile](https://www.infoq.com/agile/ "Agile")
- [Diversity](https://www.infoq.com/diversity/ "Diversity")
- [Leadership](https://www.infoq.com/leadership/ "Leadership")
- [Lean/Kanban](https://www.infoq.com/lean/ "Lean/Kanban")
- [Personal Growth](https://www.infoq.com/personal-growth/ "Personal Growth")
- [Scrum](https://www.infoq.com/scrum/ "Scrum")
- [Sociocracy](https://www.infoq.com/sociocracy/ "Sociocracy")
- [Software Craftmanship](https://www.infoq.com/software_craftsmanship/ "Software Craftmanship")
- [Team Collaboration](https://www.infoq.com/team-collaboration/ "Team Collaboration")
- [Testing](https://www.infoq.com/testing/ "Testing")
- [UX](https://www.infoq.com/ux/ "UX")
Featured in Culture & Methods
Stéphane Di Cesare and Cat Morris share how engineers can move from being a "cost center" to a value driver using product discovery. They explain the "Double Diamond" framework and why identifying user problems must precede building solutions. Learn to choose the right metrics, build customer empathy through shadowing, and use business context to maximize the impact of your technical work.

All in culture-methodsFollow Topic
- [Infrastructure](https://www.infoq.com/infrastructure/ "Infrastructure")
- [Continuous Delivery](https://www.infoq.com/continuous_delivery/ "Continuous Delivery")
- [Automation](https://www.infoq.com/automation/ "Automation")
- [Containers](https://www.infoq.com/containers/ "Containers")
- [Cloud](https://www.infoq.com/cloud-computing/ "Cloud")
- [Observability](https://www.infoq.com/observability/ "Observability")
Featured in DevOps
J. Paul Reed discusses the "ironies of automation" - a 40 years-old concept now amplified by AI. He explains how advanced systems often make the human operator more crucial, not less, while simultaneously degrading the skills needed to intervene. Sharing real-world stories of "AI-fueled" incidents, he shares why over-reliance on AI can double recovery times and how to maintain resilience.

All in devopsFollow Topic
[Events](https://events.infoq.com/ "Events")
Helpful links
- [About InfoQ](https://www.infoq.com/about-infoq "About InfoQ")
- [InfoQ Editors](https://www.infoq.com/infoq-editors "InfoQ Editors")
- [Write for InfoQ](https://www.infoq.com/write-for-infoq "Write for InfoQ")
- [About C4Media](https://c4media.com/ "About C4Media")
- [Diversity](https://c4media.com/diversity "Diversity")
Choose your language

[InfoQ Homepage](https://www.infoq.com/ "InfoQ Homepage")[News](https://www.infoq.com/news "News")Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale
[DevOps](https://www.infoq.com/Devops/ "DevOps")
Rethinking Logs in the Age of AI Analysis (Webinar Jul 9th)
Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale
May 22, 2026 3 min read
by
- Craig Risi
Follow Software Architect | Game Designer| Writer | Speaker
#### Write for InfoQ
Feed your curiosity.Help 550k+ global
senior developers
each month stay ahead.Get in touch
Log in to listen to this article
Audio ready to play
0:00 0:00
Normal 1.25x 1.5x
Like
Discord has detailed how it rebuilt its database operations around a new internal orchestration framework called the Scylla Control Plane (SCP), enabling its small infrastructure team to automate large-scale ScyllaDB cluster management tasks that previously took days of manual work. The platform now automates complex operations such as rolling upgrades, cluster expansion, shadow cluster provisioning, and node recovery across hundreds of database nodes, dramatically reducing operational overhead and risk.
The move reflects the growing challenge faced by hyperscale platforms: operating increasingly complex distributed databases with relatively small engineering teams. Discord's Persistence Infrastructure team manages dozens of ScyllaDB clusters containing hundreds of nodes that store core platform data, including messages, channels, and servers. Historically, these operations relied on fragile Python and shell scripts that required deep institutional knowledge and constant manual supervision. According to Discord, the operational burden had become unsustainable as infrastructure scale and complexity increased.
To solve this, Discord developed SCP as a generalized orchestration and automation framework built around reusable tasks, workflows, and resumable jobs. The system allows engineers to declaratively define cluster-wide operations in YAML while enforcing safety checks, retries, dependency validation, concurrency controls, and rollback protections automatically.
The framework was designed specifically to address three major weaknesses in the company's earlier tooling: unsafe execution order, inability to recover from interruptions, and difficulty extending automation to new operational scenarios. SCP introduces explicit preconditions, state persistence through SQLite, error classification, webhook-driven alerting, and configurable parallelism, ensuring that operations can safely resume even after failures or interruptions.
One of the most significant improvements involves Discord’s use of shadow clusters - temporary, full-production replicas that receive real traffic in order to validate ScyllaDB upgrades and infrastructure changes before they affect live systems. Previously, provisioning these environments required extensive manual coordination, including node configuration, replication setup, validation, and teardown. SCP now automates much of this process, reducing operations that once consumed more than a day of engineer attention to workflows that can largely run unattended.
The automation is particularly important because Discord regularly encounters edge cases that only emerge under the platform's scale and traffic patterns. According to the company, some upgrade-related issues only surface once every node in a cluster has been updated, making realistic production simulation essential before rolling changes into live environments.
A key focus of the system is ensuring operational safety in distributed environments where mistakes can cascade across clusters. SCP uses configurable concurrency controls that allow engineers to define rules such as "never restart nodes across multiple availability zones simultaneously," protecting cluster quorum and availability during maintenance operations. The framework also enforces idempotency for tasks, ensuring that interrupted jobs can be retried safely without corrupting state or duplicating actions.
Discord emphasized that the system's biggest benefit is not just speed, but reduced cognitive load. Engineers no longer need to manually supervise long-running maintenance procedures step by step; instead, workflows execute automatically while surfacing issues only when human intervention is required.
Discord's work reflects a larger trend among hyperscale organizations toward building internal control planes and orchestration systems for stateful infrastructure. Companies operating large distributed databases increasingly recognize that ad hoc scripts and manual runbooks become operational liabilities as systems scale. Similar efforts can be seen across companies managing Cassandra- and ScyllaDB-based infrastructure, where orchestration, automation, and fault recovery are becoming central engineering priorities.
The broader Cassandra and ScyllaDB communities have long debated the operational complexity of managing distributed NoSQL systems at scale. Discussions in engineering communities on Reddit frequently point to challenges around repairs, compactions, quorum safety, and rolling upgrades, particularly in environments with hundreds or thousands of nodes. Discord's SCP initiative demonstrates how platform teams are increasingly responding by abstracting operational complexity behind policy-driven automation layers rather than relying on individual expertise and procedural discipline.
Ultimately, Discord’s Scylla Control Plane highlights a wider evolution in infrastructure engineering: moving from script-driven operations to declarative, resilient orchestration systems. As distributed databases become foundational to modern platforms, the ability to automate upgrades, recovery, scaling, and validation safely is becoming just as important as the databases themselves.
For Discord, the result is a significant operational shift. Tasks that once required sustained human attention for more than a day can now be launched, monitored, and safely resumed with minimal intervention, turning database operations from fragile manual processes into repeatable, trusted workflows.
About the Author

#### Craig Risi
Craig Risi is a man of many talents but has no sense of how to use them. He could be out changing the world but prefers to make software instead. He possesses a passion for software design, but more importantly software quality and designing systems in a technically diverse and constantly evolving tech world. Craig is also the writer of the book, Quality By Design: Designing Quality Software Systems, and writes regular articles on his blog sites and various other tech sites around the world. When not playing with software, he can often be found writing, designing board games, or running long distances for no apparent reason.
Show more Show less
#### This content is in the DevOps topic
Follow Topic
##### Related Topics:
Followers: 5076
Follow Topic
Followers: 5918
Follow Topic
Followers: 602
Follow Topic
Followers: 807
Follow Topic
Followers: 273
Follow Topic
Followers: 458
Follow Topic
* #### Related Editorial
* #### Related Sponsors
- #### Related Sponsor
Test. Protect. Repeat. Guardsquare pairs mobile app testing and protection, delivering max security with zero performance trade-offs. [Request a Quote](https://www.infoq.com/url/f/f07340f4-e545-469d-9a56-36913cc0af72/).
Related Content
Apr 30, 2026
Apr 17, 2026 
- Icon##### How to Build a Database without a Server
Jan 07, 2026 
May 12, 2026 
- Icon##### The Ironies of A^2 I^2
May 21, 2026 
May 13, 2026
May 01, 2026
Apr 30, 2026 
Apr 30, 2026
Related Sponsors
- #### The Case for Real-Time Threat Monitoring and Analysis in Modern Mobile App Security
Drive better mobile security with real‑time insights. This Guardsquare report shows why traditional client‑side defenses fall short against persistent threats and how continuous threat monitoring and analysis gives teams actionable visibility to protect apps, users, and revenue.
- #### Rethinking AppSec: Why Compiler‑Level Security Changes the Architecture Conversation (Live Webinar Jun 11th) - Save Your Seat
Security bolted on after the build process adds fragility and blind spots. Embedding protection at compile time improves performance and resilience. This session compares wrapper, runtime, and compiler approaches to help you choose where security belongs in your SDLC.
- Sponsored by

Related Content
Apr 23, 2026 
Apr 07, 2026 
Mar 26, 2026 
Mar 06, 2026 
Feb 17, 2026 
Feb 04, 2026 
**The InfoQ** Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
Enter your e-mail address
Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.
- ##### [Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks](https://www.infoq.com/news/2026/05/pip-261-dependency-cooldowns/ "Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks")
- ##### [Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production](https://www.infoq.com/news/2026/05/cloudflare-stripe-agent-commerce/ "Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production")
- ##### [Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA](https://www.infoq.com/news/2026/05/cloud-fraud-defense-recaptcha/ "Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA")
- ##### [Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking](https://www.infoq.com/news/2026/05/uber-eats-ranking-system/ "Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking")
- ##### [Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab](https://www.infoq.com/news/2026/05/grab-multi-agent-support-system/ "Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab")
- ##### [OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale](https://www.infoq.com/news/2026/05/openai-voice-ai-scale/ "OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale")
- ##### [How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery](https://www.infoq.com/news/2026/05/platform-golden-bricks/ "How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery")
- ##### [Product Thinking for Cloud Native Engineers](https://www.infoq.com/presentations/product-cloud-native/ "Product Thinking for Cloud Native Engineers")
- ##### [Accelerating LLM-Driven Developer Productivity at Zoox](https://www.infoq.com/presentations/ai-software-development/ "Accelerating LLM-Driven Developer Productivity at Zoox")
- ##### [InfoQ Launches Online AI Engineering Cohort and Certification for Senior Software Practitioners](https://www.infoq.com/news/2026/05/ai-engineering-certification-pro/ "InfoQ Launches Online AI Engineering Cohort and Certification for Senior Software Practitioners")
- ##### [xAI Releases Grok Skills and Updates Tool Calling Responses API](https://www.infoq.com/news/2026/05/xai-grok-skills/ "xAI Releases Grok Skills and Updates Tool Calling Responses API")
- ##### [AI Native Engineering](https://www.infoq.com/presentations/ai-native-engineering/ "AI Native Engineering")
- ##### [Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale](https://www.infoq.com/news/2026/05/discord-scylladb-automation/ "Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale")
- ##### [The Ironies of A^2 I^2](https://www.infoq.com/presentations/automation-incidents-ai/ "The Ironies of A^2 I^2")
- ##### [OpenTofu 1.12: The Feature Terraform Never Shipped](https://www.infoq.com/news/2026/05/opentofu-release-terraform/ "OpenTofu 1.12: The Feature Terraform Never Shipped")
**The InfoQ** Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
- Get a quick overview of content published on a variety of innovator and early adopter technologies
- Learn what you don’t know that you don’t know
- Stay up to date with the latest information from the topics you are interested in
Enter your e-mail address
Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.
#### Events
- ##### QCon AI Boston
June 1-2, 2026
June 10, 2026
July 25, 2026
- ##### QCon San Francisco
November 16-20, 2026
#### Follow us on
Youtube 232K FollowersLinkedin 26K FollowersInstagram NewRSS 19K ReadersX 57.1k FollowersFacebook 21K LikesBluesky New
#### Stay in the know
The InfoQ PodcastEngineering Culture PodcastThe Software Architects' Newsletter
General Feedback [feedback@infoq.com](mailto:feedback@infoq.com) Advertising [sales@infoq.com](mailto:sales@infoq.com) Editorial [editors@infoq.com](mailto:editors@infoq.com) Marketing [marketing@infoq.com](mailto:marketing@infoq.com)
InfoQ.com and all content copyright © 2006-2026 C4Media Inc.
Privacy Notice, Terms And Conditions, Cookie Policy
Close
[BT](https://www.infoq.com/int/bt/ "bt")