What Happens to a Downed Node When It Comes Back Into the Cluster?
There are several reasons a node could go down, but this type of event primarily falls into two categories: planned and unplanned downtime. But either way, YugabyteDB’s highly available nature provides stability. Since we replicate your data X amount of times, you have a level of fault tolerance that safeguards you from a node being down.
When a node goes down, a re-election begins for the remaining tablet-peers in the RAFT group whose leader was on the downed node. The physical copy of the data still remains on that downed node. As such, when a node comes back up, the data is typically in a different state than the remaining nodes. While it is down, this node remains a part of the RAFT group for 15 minutes.
Fifteen minutes is the default time (NOTE: this is configurable) before the tablet—as a RAFT member—is kicked out and its data deleted. If the downed node resumes operation before the 15-minute RAFT timeout, it will receive all data changes that occurred during the downtime from the remaining nodes in the quorum. This is the typical process when going through a rolling restart for updates and configuration flag changes.
If the node is down for longer than the default 15 minutes (or whatever value you configured it to) the copies of the data will replicate to the remaining nodes, and the downed node will be kicked out of the quorum. This is more likely to happen if the downtime is unplanned. If so, you will want to stand up a new node and get it added to the quorum to replace the one you lost. You should do this immediately if you expect the downed node to be out a while since your system will no longer have any flexibility in terms of fault tolerance.
For example, if you have a 3-node cluster with a replication factor of 3, you can only sustain the loss of 1 node without impacting full functionality of the cluster. To learn more about increasing your node failure threshold, read our blog on Fine-Grained Control for High Availability. You can also read more about how the RAFT consensus-based replication protocol works in YugabyteDB.
Will My Application Lose Connection to the Cluster During a Rolling Restart When I am Updating or Altering Configuration Flags?
If the connection is to the node that is going down, then yes, it will affect the connection. More specifically, if you see any errors, this means the request failed, and you should retry the transaction. One reason for performing a rolling restart across the cluster is to apply configuration flag settings that cannot be changed online. In short, we perform this often, and it is a common practice for many of our clients.
Explore our library of distributed SQL tips and tricks and general “how to” information on the YugabyteDB blog and on our DEV Community Blogs.
Check out the upcoming YugabyteDB events, including all training sessions, conferences, in-person and virtual events, and YugabyteDB Friday Tech Talks (designed for engineers by engineers).
In addition, there is some extremely popular “how to” content on the YugabyteDB YouTube channel.