YugabyteDB Resiliency vs. PostgreSQL High Availability Solutions

June 27, 2024

High Availability has traditionally been achieved in databases through enhanced and automated disaster recovery solutions such as synchronous commit physical replication and fast failover. Although these measures reduce the impact of failures, they still result in complete downtime for the application during the failover process, with all ongoing transactions being canceled. In this blog, we will compare this approach with cloud-native solutions for high availability, focusing on resiliency built directly into the database.

Overcoming the Constraints of Monolithic PostgreSQL High Availability Based On Disaster Recovery

Traditional databases are not resilient to failures because they are monolithic, with several components being a single point of failure. The monolithic components, running in a single instance, are the servers you connect to. Those components include:

The Data Manipulation Language (DML) parsing and execution
The Write Ahead Log (WAL) stream (that records the changes)
The shared buffer pool (with the current state of data pages)
The transaction table (to coordinate the ACID consistency)

If any of those components fail, crash or need to be restarted for planned maintenance, all application server’s sessions are disconnected, and their ongoing transactions are canceled. Users have to wait for recovery, then re-connect, identify what was committed (or not), and re-run transactions.

Monolithic databases are, by definition, not resilient to failure. However, they can recover from failures using restored backups or standby servers, as well as WAL recovery.

When the recovery process is automated and involves no data loss (Recovery Point Objective – RPO=0) and a short duration (Recovery Time Objective – RTO in minutes), it may still meet the requirements for high availability, as long as the total downtime (due to server failures and maintenance) aligns with your SLA (Service Level Agreement). However, it does not fulfill the criteria for resiliency, as each recovery results in all users being disconnected, encountering errors, and having to wait, reconnect, check the state, and restart their transactions.

To reduce the RTO to minutes, recovery in traditional databases relies on full database replication, below the database itself, with block storage replicating the writes, or on top of it, by shipping the WAL and continuously applying it on a previously restored backup. To reduce RPO to zero, this replication must be synchronous.

To be resilient without needing recovery, an SQL database must be designed with horizontal scalability to ensure no component is a single point of failure. This cannot be achieved in PostgreSQL with the original software, even with extensions and external orchestrations, because of the monolithic core functions.

A fork of PostgreSQL (like YugabyteDB) can replace those core functions to add built-in resilience, while still running the PostgreSQL code for SQL processing. Note that sharding on top of PostgreSQL (like Citus) may reduce the impact of a failure, but doesn’t solve the original problem of resiliency as each shard is still monolithic PostgreSQL.

Resilience vs Failover and Recovery for Synchronous Replication

YugabyteDB high availability (HA) is based on resilience, unlike PostgreSQL HA which is based on failover and recovery techniques. Both use synchronous replication and don’t allow data loss, but in a different way.

Both solutions are often referred to as zero RPO (Recovery Point Objective), even when it doesn’t involve recovery thanks to resiliency. Let’s compare the recovery time (Recovery Time Objective when recovery is engaged) and resiliency (application continuity).

Below I examine what can be deployed on AWS in order to compare the different solutions with their recovery and resiliency characteristics. Options include:

Amazon RDS Multi-AZ DB instance for block storage replication
Amazon RDS Multi-AZ DB cluster or self-managed PostgreSQL with Patroni for WAL physical replication
YugabyteDB self-managed or managed by Yugabyte on AWS for horizontal scalability.

Similar configurations are available in other clouds and (except Aurora) on-premises.

Here are the principal characteristics of these PostgreSQL-compatible database solutions:

PostgreSQL-Compatible Highly Available Deployments	What is Replicated?	When is it waiting for Sync?	Write Quorum / Total Replicas	Typical Failure Scenarios With Availability Zones (AZ)
				One AZ down, resiliency for applications in other AZs: 🔄 failover vs ✅ resilience	With one AZ permanently down (disaster), one more compute or storage failure in the remaining AZ
RDS Multi-AZ DB instance (similar to on-premises SAN replication)	What is Replicated: Data pages (blocks) and WAL files	Each write I/O Wait event: WALSync	2 / 2 in 2 single-region AZs	⏳Wait 30 seconds timeout ❌ All connections fail ❌ Rollback all transactions ⌛Wait a few minutes 🔄Application must reconnect and retry	🚫 Not protected, no additional failure possible
RDS Multi-AZ DB cluster (similar to on-premises Patroni or Kubernetes operators)	What is Replicated: PostgreSQL WAL	Each commit Wait event: SyncRep	2 / 3 in 3 single-region AZs	⏳Wait 15 seconds timeout ❌ All connections fail ❌ Rollback all transactions ⌛Wait a few minutes 🔄application must reconnect and retry	⚠️ With additional failure, it can run but is not protected
RDS Aurora PostgreSQL (similar to other disaggregated storage like AlloyDB, Neon)	What is Replicated: PostgreSQL WAL + storage metadata	Each commit Wait event: XactSync	4 / 6 in 3 single-region AZs	⏳Wait 15 seconds timeout ❌ All connections fail ❌ Rollback all transactions ⌛Wait 30 seconds 🔄Application must reconnect and retry	✅Resilient to storage failure after repair (6 replicas with write quorum of 4, and read quorum of 3) ❌ Failover if instance failure (monolithic PostgreSQL + storage bookkeeping state)
YugabyteDB with Replication Factor 3 over 3 Availability Zones	What is Replicated: Distributed WAL (Raft log)	Each write request (batch of write operations)	2 / 3 in 3 AZs, single or multi-region	✅ Connections from other AZ remain ⌛Wait 15 seconds TCP timeout ✅ Transactions continue – no errors	❌ With RF3, another failure stops a subset (the tablets out of quorum) to avoid inconsistency
YugabyteDB with Replication Factor 5 over 2 regions	What is Replicated: Distributed WAL (Raft log)	Each write request (batch of write operations)	3 / 5 in 3 AZs, single or multi-region	✅ Connections from other AZ remain ⌛Wait 15 seconds TCP timeout ✅ Transactions continue – no errors	✅ RF5 is resilient to one more failure, compute or storage. Only the connections to the node failing are canceled

These are the two most common situations in a public cloud running on commodity hardware where failures are the norm:

One failure can be an Availability Zone (AZ) inaccessible for a short or long time
In the case of an AZ down for a long time, another failure happening in the remaining AZs

It is easy to test resilience by simulating a failure in a test environment. Connect to one node and run the following from psql:

create extension if not exists pgcrypto;
create table demo ( id uuid primary key default gen_random_uuid()
, value int, ts timestamptz default now() );
\timing on
begin transaction;
 insert into demo (value) select generate_series(1,1000);
\watch count=10
commit;

It is running for 10 seconds, stop the node in the middle of it. PostgreSQL or Aurora will get an error:

INSERT 0 1000
Time: 3.433 ms
INSERT 0 1000
Time: 3.545 ms
INSERT 0 1000
Time: 3.480 ms
INSERT 0 1000
Time: 3.502 ms
FATAL:  57P01: terminating connection due to administrator command
LOCATION:  ProcessInterrupts, postgres.c:3300
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 0.612 ms

!?> commit;
You are currently not connected to a database.
!?>
!?>

In YugabyteDB, you can connect to any node. Typically, the application in one AZ connects to nodes in the same AZ. To simulate a failure, stop a node (not the one you are connected to). You see a higher latency at the time of failure, a maximum of 15 seconds, as it is the TCP timeout, and the transaction continues and can be committed:

INSERT 0 1000
Time: 32.165 ms
INSERT 0 1000
Time: 13647.341 ms (00:13.647)
INSERT 0 1000
Time: 36.886 ms
INSERT 0 1000
Time: 31.400 ms
COMMIT
Time: 4.758 ms

YugabyteDB is resilient to such failures thanks to horizontal scaling. The other solutions can connect to only one server to process their consistent reads and writes, and then have to rollback, failover, and recover.

Let’s look at the details of each solution.

PostgreSQL with Patroni or Other Fast-Start Failover Automation

Patroni is a template for setting a primary/standby configuration for PostgreSQL with the necessary automation to avoid the most common pitfalls.

PostgreSQL provides settings for each instance to define its role (primary or standby) and how the WAL is shipped between the instances. Patroni maintains a configuration state, stored in distributed coordination datastores (ZooKeeper, Etcd, or Consul), and agents check and set the configuration. They also initiate safe failovers or switchovers when the primary server is down.

Patroni should maintain at least two standby databases to configure an automated failover for no data loss. One is the target for synchronous commit to guarantee a no-data-loss failover, and the others are read replicas.

Suppose the Sync Standby is down or isolated from the other. In that case, Patroni can elect the other replica to become the new Sync Standby, so that transactions on the primary database can continue. If the primary database is down or isolated from the others, the Sync Standby can be activated to become the new primary.

A write-split-brain situation is avoided by relying on a quorum: the primary server isolated from the others is stopped before the Sync Standby is activated. A read-split-brain is still possible when reading from the replicas, which may show different stale states because they have their own timeline.

A failover, even automated, is not a straightforward operation:

First: it is an application downtime because all connections to the primary database must fail.
Second: the expected performance is not immediately available once the standby database is activated because it starts with a cold cache.
Third: when the failure is resolved, the old primary must be reinstated as a standby to restore the same level of protection.

There are also additional operational considerations, like adapting the monitoring and backup to the new situation. For these reasons, it is not recommended that you initiate a failover immediately in case of transient network failure, which adds waiting where the application may time out.

Regarding availability, waiting for 30 seconds on a network issue is probably better than the downtime associated with a failover. Practically, this means that in case of instance crash or network failure, and with the best automation, the application will be down for a few minutes. This accounts for the High Availability requirement, but there are more reasons for downtime with this configuration.

In addition to infrastructure failures, there are various reasons to stop servers, such as changing instance parameters that require a restart, adjusting the instance size to scale up during high workloads, or scaling down to reduce costs.

You can utilize the Patroni configuration to perform maintenance on a replica first and then switchover to it. This process adds to the downtime, with the switchover duration similar to a failover. Some maintenance needs more downtime, like upgrading PostgreSQL itself. The replication is physical, it cannot run with different versions of PostgreSQL in the primary/standby replication.

The time to upgrade a PostgreSQL database depends on many factors, including its size, because all tables must be analyzed again with the new version.

When you add all the downtimes for unplanned and planned outages, the PostgreSQL primary/standby configurations, automated by Patroni or others, hardly fit into High Availability SLAs. Because PostgreSQL can run only one read-write instance, it is not resilient to instance stop or network partition. Its protection is a fast, automated recovery initiated when failure lasts some time.

Another important consideration is that even while the commit is synchronous, it is not atomic. If there’s a failure during a transaction commit, it may be effectively committed on the standby before it is on the primary. Then, the state of the transaction will depend on whether the primary comes back or it fails over to the standby.

Amazon RDS Multi-AZ or Data Synchronization

Amazon RDS has two flavors of PostgreSQL-compatible databases. One is very close to open source PostgreSQL, running it as a managed service. The other is Aurora with PostgreSQL compatibility.

For both, the high-availability deployment is called Multi-AZ. It simply means you can run the compute instances in more than one Availability Zone – even if only one at a time can be opened for your application for consistent reads and writes.

For database storage, Amazon RDS uses three technologies:

Block storage replication (Multi-AZ DB instance) where the datafiles are synchronized between the zones. This has a latency overhead on all writes, but doesn’t require a standby instance to replicate them.
Physical standby (Multi-AZ DB cluster) sends the WAL (Write Ahead Logging) to a standby database that applies to its local storage. This has a latency overhead on all commits, but not other writes.
Distributed storage (Aurora) synchronizes the WAL applied to the storage servers. This has a latency overhead on all commits, but allows more than one mirror without significant overhead. Six copies of each block are distributed to three Availability Zones.

All those solutions offer similar resilience: the storage is resilient to failure, but not the compute instance. The storage is synchronized across multiple Availability Zones, and, in case of failure of the primary instance, a standby can be activated, reading the database state from the storage with no data loss (RPO zero). The RTO depends on the deployment choices, such as whether a standby database is ready to be activated or has to be started. In the case of Aurora, this standby can also be used as a timeline-consistent read replica to offload some reporting activity.

Those solutions increase availability, but rely on one compute instance that can accept all application’s consistent reads and writes. If this instance crashes or is not accessible, all application connections to the databases fail in error and their transactions are rolled back. The application can only continue when the failure of the primary is confirmed, and the standby has been activated.

Because of the monolithic primary database, any maintenance that requires a restart is downtime, such as changing parameters, patching the OS, scaling up or down, or upgrading the database.

YugabyteDB Horizontal Scalability for PostgreSQL Applications

A distributed database is designed with no single point of failure. The application can connect to any node.

Data is distributed to those nodes with a replication factor of 3 to allow one zone failure or higher to allow for more failures. It is not restricted to 3 Availability Zones and can be multi-region. The reads and writes are not constrained to be processed by a single node, so if one is stopped or is inaccessible because of planned maintenance or infrastructure failure, the application continues.

Of course, if the node the application is connected to fails, it gets an error. For Availability Zone failure this doesn’t matter, because each application server connects to the nodes within its zone and if the zone is isolated, the application on this zone is also down. Even if the database is still running in the isolated AZ, it doesn’t have the quorum and then doesn’t accept reads and writes to avoid split brain.

For single-node failure, disconnection is limited to the node that fails, so increasing the number of nodes per zone decreases the probability of such failure. When the node or zone that fails is another one in the cluster, the application doesn’t get an error. After TCP timeout, typically set to 15 seconds, the transaction continues as if nothing happened because a new Raft leader was elected 3 seconds after the one with the current lease failed to communicate with the others.

We often qualify this as RPO=0, no data loss guaranteed by the synchronization of writes to the quorum, and RTO between 3s and 15s, because of the leader election and TCP timeout. However, it is not a recovery per se because it consists only of a Raft leader election. As long as the majority of Raft peers are available, the latest state is available without a need to restore or recover anything. No error was raised to the application and it didn’t have to failover to another endpoint. The transaction table itself is replicated, which is why transactions can continue without needing to recover and roll back. From the application point of view, the database was resilient to infrastructure failure.

In a YugabyteDB cluster, the nodes can run with two software versions during a rolling upgrade. Hence, the resilience extends to minor and major upgrades that are transparent to the application.

Disaster Recovery with Asynchronous Replication

To provide fault tolerance to availability zone or region failures with YugabyteDB resilience, it is necessary to have at least three zones or close regions to achieve a quorum with reasonable latency.

In the case of two data centers or distant regions, asynchronous replication is used, which may result in replica lag at the time of failure. The recovery on these replicas may be incomplete and could lead to some data loss (Recovery Point Objective typically in seconds), but is timelime-consistent.

The recovery process involves downtime (Recovery Time Objective typically in minutes). It may require a manual or automated decision based on the estimated data loss and the cause of the disaster. For instance, if the primary data center is destroyed, the priority may be to restore the service as quickly as possible, even if it means accepting some known data loss. In the case of a power outage, the downtime may be extended until power is restored to prevent data loss.

In all scenarios, the disaster recovery solution involves having a secondary cluster in a remote region with asynchronous replication. Like the primary one, this secondary cluster is designed to tolerate common failures for High Availability, with local synchronous replication as described above. The secondary cluster can provide stale but timeline-consistent reads to handle reporting and backups.

In Amazon RDS, the disaster recovery setup is called “Cross-Region read replica” or “Global Database” for Aurora. In YugabyteDB, this is known as xCluster with transactional replication (xCluster when one cluster is active and the other in standby). The decision to take action can be based on the data loss determined from xCluster safe time.

Conclusion

YugabyteDB provides a typical cloud-native database configuration, ensuring high availability by withstanding common failures. This eliminates the need for failover and recovery, and provides additional protection for disaster recovery. This approach is particularly valuable in public clouds operating on commodity hardware, where failures are the norm.

Traditional databases (except Oracle RAC) do not distinguish high availability from disaster recovery. They simply reduce the time to recover, rather than being resilient to failures. It is also essential to account for database upgrades, which is a major source of downtimes. In this event, YugabyteDB uses built-in resilience, where replication can be cross-version, to allow rolling upgrades. In monolithic databases, the dictionary, or catalog, cannot be upgraded when the application is online. For major upgrades, this can mean a couple of hours of downtime