Sharded vs. Distributed: The Math Behind Resilience and High Availability
Probability is the branch of mathematics that deals with uncertainty. It helps us understand the likelihood of different outcomes occurring. Below, we consider two alternative architecture options for scaling a database horizontally and employ probability theory to assess the resilience of each architecture to potential failure.
Vertical vs Horizontal Database Architecture Options
Vertical scaling involves increasing the resources of a single server to enhance its handling capacity. This means adding more CPU, memory, or storage resources to an existing server. This scaling approach is limited by the physical constraints of a single server and has well-documented limits in terms of connections, transactions per second, and storage.
Horizontal scalability involves spreading workloads across multiple servers. This approach allows for additional servers to be added to a system, providing a scalable path beyond what a single server can achieve.
Horizontal Database Scaling Architecture Options
The two horizontal scaling architectures we consider in this blog are application-level sharding and distributed SQL.
Application-Level Sharding
Application-level sharding is a horizontal scaling strategy that leverages domain-specific knowledge to partition the data into multiple database instances running on multiple servers. Each database instance is isolated, enabling workloads to be scaled. This architecture requires custom logic for routing, rebalancing, and handling cross-shard operations.
Distributed SQL
A distributed SQL database, like YugabyteDB, provides a single logical database that horizontally scales across multiple servers with built-in replication and quorum-based logic to implement global ACID transactions. Additional servers can be added and integrated into the system, enabling workloads to be scaled. Automatic routing, rebalancing, and handling of cross-shard operations simplifies development and speeds up time-to-market.
But what about high availability and resilience? How do the uptime characteristics of these two horizontally scalable architectures compare?
For this comparison, we assume both architectures are running in Google Cloud Platform using VMs hosted as part of the Compute Engine Service. Google Cloud Platform provides a monthly uptime Service Level Objective of 99.9% for a single VM/instance. We will use this SLO in our system availability calculations, as documented here: https://cloud.google.com/compute/sla?hl=en
Architecture 1 – Application-Level Sharding
What Is Application-Level Sharding?
An application-sharded system partitions data across multiple servers that then operate semi-independently..
- Data is manually partitioned across servers — e.g., customers A–F on server 1, G–L on server 2, etc.
- Each server is responsible for only its slice of the data.
- The application must route queries to the correct server.
- If a server fails, its data becomes unavailable, even if the other servers are healthy.
This architecture comprises multiple independent database servers running in parallel. Each server retains the same profile of compute resource requirements as the underlying monolithic architecture.
Availability of a 6-Node Application-Level Sharded System
Suppose we have 6 database nodes, each running on its own virtual machine instance in GCP. GCP offers a Service Level Objective for each VM of 99.9%.
We know that:
- The probability of a node being available, P(node is available) = 0.999
- Nodes are independent of each other
- The system needs all 6 nodes to be available
In probability theory, independent events are events whose outcomes do not affect each other. For example, when throwing four dice, the number displayed on each dice is independent of the other three dice.
Similarly, the availability of each server in a 4-node application-sharded cluster is independent of the others. This means that each server has an individual probability of being available or unavailable, and the failure of one server is not affected by the failure or otherwise of other servers in the cluster.
In reality, there may be shared resources or shared infrastructure that links the availability of one server to another. In mathematical terms, this means that the events are dependent. However, we consider the probability of these types of failures to be low, and therefore, we do not take them into account in this analysis.
Mathematically, if two events A and B are independent, then the probability of both A and B happening together is the product of their individual probabilities;
P(A and B) = P(A)*P(B)
Using the example of the dice:
The probability of throwing a 6 with one dice is: 1/6 = 0.16667.
The probability of throwing six 6s together is: (1/6)^6=0.00002
Returning to our 6-node database cluster:
P(all 6 nodes available)=P(1 node available)^6= 0.999^6 = 0.99401
The 6-node sharded architecture, therefore, supports a Service Level Objective of 99.4%, which is notably lower than the SLO of the underlying VMs.
Architecture 2 – Distributed SQL
What Is a Distributed SQL Cluster?
A distributed SQL database automatically shards the data of a single logical database across multiple servers. Additionally, for resilience, it maintains replicas for each shard and typically uses a quorum-based algorithm to coordinate updates, ensuring strong consistency for reads and writes.
- Each shard of data is replicated across multiple nodes, with one replica designated as the leader.
- A quorum (majority) is required to write data (e.g., 2 of 3 if the replication factor is 3).
- Quorum is also required for reads, which is elegantly achieved by routing the request to the leader, avoiding the need to issue a read to all 3 replicas and wait for a majority to respond.
- Data is not tied to a single node.
- The system can tolerate node failures and still serve requests.
Availability of a 6-Node RF3 Distributed SQL Cluster
Suppose we have 6 nodes, each running on its own virtual machine instance in GCP. GCP offers a Service Level Objective for each VM of 99.9%.
Each node manages one or more shards of data. Each shard is in a quorum group, with its data replicated on two other nodes (assuming replication factor (RF) is 3). To protect against availability zone (AZ) outages and individual node failures, the cluster is typically distributed across three availability zones, and the data distribution algorithm ensures that replicas of a shard are always placed in different availability zones.
In probability theory, the binomial distribution models the number of expected outcomes during a series of trials or tests. For example, when throwing dice, the binomial distribution can be used to calculate the probability of getting two 6s when throwing three dice.
We know that the probability of rolling a 6 is: 1/6^ = 0.16667.
We know that the probability of not rolling a 6 is: 5/6^ = 0.83333.
The probability of rolling 2 sixes followed by ‘not a six’ is therefore:
0.16667*0.16667 *0.83333= 0.02315 = 2.315%
The player may roll a pair of sixes in any of the following combinations:
- Roll 2 sixes followed by not a six
- Roll a six, not a six, and then a six
- Roll not a six followed by two sixes.
3 combinations of rolls result in a pair of sixes. So, the probability of rolling a pair of sixes is:
3*2.315%=6.944%
The formula for calculating the Binomial Distribution, assuming p is the probability of success in 1 trial, is as follows:
P(k successes out of n trials) = n/k .p^k*(1-p)^(n-k)
where n/k= the combinations of k in n=n!/(k!.(n-k)!)
Note: Combinations of k in n’ is British terminology; students of ‘math’ in America will recognize this as ‘n choose k.’
So, to calculate the probability of rolling two 6s when throwing 3 dice:
P(two 6s out of 3 dice) = 3/2 .p^2.(1-p) = 3.p.(1 – p) = 3*0.16667^2*0.83333 = 0.06944
Returning to our 6-node database cluster, we can use the binomial distribution to calculate the probability of k servers being available in a cluster of n servers. The calculation is as follows:
P(k servers available out of n) = n/k .p^k*(1-p)^(n-k) where n/k=n!/(k!.(n-k)!)
We know that:
- P(node is available) = 0.999
- Nodes are independent of each other
- The nodes are evenly placed in 3 availability zones
- There are many quorum groups spread across the servers
- The raft groups are organized such that replicas are always in separate availability zones
- If 1 node is lost, only 1 copy of the data is impacted, so the cluster remains available
- If 2 nodes are lost, so long as they are in the same AZ, only 1 copy of the data is impacted, so the cluster remains available.
- If 3 or more nodes are lost, 2 or more copies of the data are impacted, and the cluster would be unavailable.
In other words, the 6-node system is available if:
- All 6-nodes are up
- Exactly 5-nodes are up
- Exactly 4 nodes are up, but the two down nodes are in the same AZ.
P (quorum) = P(6 up) + P(5 up)+ P(4 up with the 2 down in 1 AZ)
The combination of 4 nodes available in the 6 node cluster with the added constraint that the 2 unavailable nodes must be from a single availability zone is referred to as a “Constrained Combinatorial Sets.” This is where items are chosen from a larger group, but with certain rules or limitations that restrict the possible combinations.
These constraints can be based on relationships between elements, resource limitations, or other factors, which reduce the number of valid combinations. In our case, we can only choose elements from 1 availability zone.
Working through a specific case of choosing 4 nodes in a 6 node cluster, we have:
(6 choose 4) =6/4=6!/(4!.(6-4)!) =6!/(4!.(6-4)!) =720/(24 .2) = 15
Calculating the combinations of 4 nodes in 6 node cluster with the added constraint that the 2 other nodes must be from a single availability zone is mathematically complex, however intuitively, it is the case that the 2 other nodes are in AZ1, AZ2 or AZ3, hence there are three combinations. So we have:
(6 constrained choose 4) = 3
We will use the following notation to describe a constrained combinatorial set with the constraint that the non-selected items are from 1 AZ:
(n constrained choose k) meaning choose k from n with (n – k) being from 1 AZ in RF3 configuration and 1 or 2 AZs in RF5 configuration.
Returning to the calculations:
P(6 up) = (6 constrained choose 6) .p^6 = 0.999^6 = 0.9940149800
P(5 up) = (6 constrained choose 5) .p^5.(1-p) = 6 .p^5.(1 – p) = 0.0059700599
P(4 up) =(6 constrained choose 4) .p^4. (1-p)^2 = 3 .p^4. (1-p)^2 = 0.0000029880
P(quorum) = 0.9940149800+0.0059700599+0.0000029880 = 0.9999880279
The 6-node RF3 quorum-based architecture, therefore, supports a Service Level Objective of 99.998%, which is notably higher than the SLO of the underlying VMs.
Availability of a 10-Node RF5 Distributed SQL Cluster
Suppose we have 10 nodes, each running on its own virtual machine instance in GCP. GCP offers a Service Level Objective for each VM of 99.9%.
Each node manages one or more shards of data. Each shard is in a quorum group, with its data replicated on four other nodes (assuming replication factor (RF) is 5).
To protect against availability zone outages and individual node failures, the cluster is typically distributed across five availability zones. The data distribution algorithm ensures that replicas of a shard are always placed in different availability zones.
We know that:
- P(node is available) = 0.999
- Nodes are independent of each other
- The nodes are evenly placed in 5 availability zones
- There are many quorum groups spread across the servers
- The raft groups are organized such that replicas are always in separate availability zones
- If 1 node is lost, only 1 copy of the data is impacted, so the cluster remains available
- If 2 nodes are lost, only 2 copies of the data are impacted, so the cluster remains available
- If 3 nodes are lost, so long as they are in 2 or fewer AZs, only 2 copies of the data are impacted, so the cluster remains available.
- If 4 nodes are lost, so long as they are in 2 or fewer AZs, only 2 copies of the data are impacted, so the cluster remains available.
- If 5 or more nodes are lost, 3 or more copies of the data are impacted, and the cluster would be unavailable.
P (quorum) = P(10 up) + P(9 up)+ P(8 up)+ P(7 up)+ P(6 up)
All of the combinations below are constrained combinatorial sets with the constraint that the non-selected items are from two or fewer availability zones.
P(10 up) =(10 constrained choose 10) . p^10. (1-p)^0 = 1 . p^10. (1-p)^0 = 0.9900448802
P(9 up) =(10 constrained choose 9) .p^9. (1-p)^1 = 10 . p^9. (1-p)^1 = 0.0099103592
P(8 up) =(10 constrained choose 8) .p^8. (1-p)^2 = 45 .p^8. (1-p)^2 = 0.0000446413
P(7 up) =(10 constrained choose 7) .p^7. (1-p)^3 = 40 .p^7. (1-p)^3 = 0.0000000397
P(6 up) =(10 constrained choose 6) .p^6. (1-p)^4 = 10 . p^6. (1-p)^4 = 0.0000000000
P(quorum)= 0.9999999204
The 10-node RF5 quorum-based architecture, therefore, supports a Service Level Objective of 99.999992%, which is significantly higher than the SLO of the RF3 cluster.
Summary
Architectural Impact on Availability
Traditional architectures are limited by single-node failure risk. Application-level sharding compounds this problem because if a node goes down, its shard and therefore the total system becomes unavailable.
In contrast, distributed databases with quorum-based consensus, like YugabyteDB, provide fault tolerance and scalability, enabling higher resilience and improved availability.
Direct Comparison
Architecture | Service Level Objective | |
Single Node | 99.9% | (Three 9s) |
6 Node Application-Level Sharding | 99.4% | (Two 9s) |
6 Node RF3 Distributed SQL Cluster | 99.998% | (Four 9s) |
10 Node RF5 Distributed SQL Cluster | 99.999992% | (Seven 9s) |
The Business Impact of Downtime
Mathematical probability can be a hard concept to navigate. For example, if the weather forecast model predicts that there is a 50% chance of rain on Wednesday, it does not mean that it will rain for half the day. However, if the forecast says that there is a 75% chance of rain on Thursday, the model is predicting that it is twice as likely to be dry on Wednesday. We calculate this as follows:
P(Dry on Wednesday) = 1 – P(Rain on Wednesday) = 1 – 0.5 = 0.5
P(Dry on Thursday) = 1 – P(Rain on Thursday) = 1 – 0.75 = 0.25
Likelihood of dry on Wednesday compared with Thursday = (P(Dry on Wednesday))/(P(Dry on Thursday))= 0.5/0.25 = 2
The summary table above shows there is a much greater likelihood of failure when using a 6 node application-level sharding architecture compared with a 10 node RF 5 distributed SQL Cluster. Specifically:
Likelihood of failure of 6 Node App- Sharded compared with 10 node RF5 =
(P(6 Node App-Sharded unavailable))/(P(10 Node RF unavailable))= (100 – 99.4)/(100 – 99.999992) =75000
Does Resilience Matter?
Enterprises that deliver high-throughput, real-time transaction services—like payment processors and anti-money laundering—are critically dependent on the resilience of their infrastructure.
Every minute of downtime is lost revenue. It erodes trust and potentially causes churn. For example, a platform handling 10,000 transactions per second at $50 each with a 2% fee would lose revenue of $600,000 per minute, just in fees.
Mark Watson, CTPO of Comply Advantage, a platform that monitors transactions in real time to detect fraud and AML violations, says, “an outage could allow illicit activity to go undetected, creating regulatory exposure for our clients and possible shared liability running into hundreds of thousands of dollars. We operate under strict contractual uptime guarantees, as an outage could trigger penalties and immediate executive escalations.”
So yes, resilience matters. This is why operational resilience has progressed beyond well-documented processes and runbooks activated during a failure scenario and is now addressed by resilient, self-healing architectures like distributed SQL.
This is the purpose of DORA (the Digital Operational Resilience Act), which aims to strengthen the digital resilience of the financial sector in the EU by ensuring that firms can withstand, respond to, and recover from all types of technology disruptions and threats.
Conclusion
Traditional architectures, particularly those using single-node or application-level sharding, are prone to failure and offer limited availability. In contrast, distributed SQL databases with quorum-based replication—such as YugabyteDB—provide significantly higher availability, fault tolerance, and resilience.
The difference is not just technical but business-critical: downtime can result in substantial revenue loss, reputational damage, and regulatory risk. As operational demands and regulatory expectations increase, adopting resilient, self-healing architectures becomes essential for any enterprise relying on high-throughput, real-time services.
Read our new white paper ‘Architecting Apps for Ultra-Resilience with YugabyteDB’ to find out more about ultra-resilience, why it’s crucial for modern applications, and how YugabyteDB can help you achieve it.