YugabyteDB 1.2 Passes Jepsen Testing
You can join the discussion about the results on HackerNews here.
Last year we published our DIY Jepsen testing results – including the tests and failure modes implemented as well as the bugs found. We recently engaged Kyle Kingsbury, the creator of the Jepsen test suite, for an official analysis and are happy to report that YugabyteDB 1.2 formally passes Jepsen tests using the YCQL API. To quote from the report:
Kyle found three safety issues, all of which are fixed as of YugabyteDB 1.2:
One issue caused a violation of snapshot isolation under huge clock fluctuations. The underlying problem was that our locking subsystem, used to ensure transactions do not execute concurrently, did not propagate the status of trying to acquire the locks. This resulted in the higher layer assuming the lock was successfully acquired within a deadline, even when the lock acquisition timed out. The fix was simply to propagate the status of the lock operation.
A second issue revealed that in rare conditions involving multiple network partitions and membership changes, successfully acknowledged writes could be discarded. The underlying issue was that the Raft implementation in YugabyteDB would accept votes from any peer (even non-voting) in the active config, causing the write to get acknowledged to the client with insufficient replication. This happens when a failure (such as a network partition) happens when the Raft membership is changing (for example, during a load balancing operation to ensure even distribution of queries across nodes). This has been fixed by accepting votes only from voting members in the active config.
A third issue (which was already identified and fixed) caused reads to violate snapshot isolation ultimately leading to data corruption. This issue happened because a conflicting transaction’s committed status could be missed by the conflict resolver at the moment when that conflicting transaction’s provisional records were being converted to permanent records. This would result in an undetected conflict and a write based on a stale read. This bug was fixed in version 1.1.10. If you’re wondering about why an already uncovered issue made its way into this list, read on to the end!
The rest of this blog post adds some behind-the-scenes color in terms of the good, the bad and the ugly (or in other words, bloopers) parts of this effort.
YugabyteDB passed Jepsen testing, which is probably the best thing in itself. YugabyteDB uses a unique design to achieve linearizability with high performance while being very resilient to errors under clock skews. We are stoked to see that Kyle really liked this unique design.
Below are some of the positive things Kyle had to say (note that the quotes slightly paraphrased below for better readability).
YugabyteDB uses a combination of the monotonic and the HLC to make sure it is much more resilient to clock skew. Note that the —
max_clock_skew_usec config flag controls the maximum clock skew allowed between any two nodes in the cluster and has a default value of 50ms. Kyle calls this out in his report (emphasis is mine in these quotes):
One of the core principles when designing YugabyteDB was offering high-performance without sacrificing correctness. Kyle calls this out as well.
For me personally, the thing that stands out the most is Kyle’s comment about the Yugabyte engineering team. Work is a lot more fun and satisfying when an incredibly talented yet humble team works in an open, fast-paced environment.
Nothing in this world is perfect. There is no such thing as a bug-free application. Of course, correctness bugs are always bad, but these bugs along with the fixes are discussed above. So what else was bad, and specifically what are the lessons that we are taking back from this effort?
We have seen our users get started with YugabyteDB on a local cluster in minutes. However, when working with Kyle, we realized that the multi-node distributed cluster was not very intuitive to setup because it requires our users to understand the underlying architecture first. To make matters worse, there are a number of advanced flags that are available to tune the system. Not all of which are documented clearly, and trying to discover these flags from the software is next to impossible.
This is clearly an area of improvement for YugabyteDB, and we are taking these lessons to heart. Stay tuned for a post that will talk about this aspect in detail.
YugabyteDB is designed to be a highly available system. Between the database servers, after a network partition heals, Kyle pointed out that YugabyteDB could sometimes take over 60 seconds to recover. This has now improved to take just 25 seconds to recover.
However, there is room for improvement between the client and the server. While existing clients cache the information on how the data is partitioned and replicated across cluster nodes, new application clients are required to read this information from one node (the YB-Master leader). While this has not been an issue in practice, if that one node is partitioned, this causes a perceived unavailability from the application. Kyle calls this out in his report as well.
Kyle also suggests the fix to solve this issue, and this is an issue we are working on.
We take documentation very seriously. But apparently, that is not quite enough. Kyle brought a whole new level of thoroughness into this process. He went through the website and our docs finding a bunch of issues. Of course, they all seem pretty obvious once pointed out. To name a few in his own words:
During the Jepsen testing effort, there were some of those situations where Murphy’s law kicked in, and we were not sure if it was appropriate to laugh or cry.
During the testing of transactional guarantees, Kyle wanted to test cases when he could query keys across tables in one transaction, or perform reads and writes as a part of the same transaction. While these were features we could implement, the question was one of if we should. Kyle writes about this in the report:
We created the YCQL API to be used at Snapshot isolation level, with the primary goal of dealing with massive scale (say 100s of nodes) at very low latencies. In the spirit of keeping the API foolproof for the end user, we decided against adding these query types since it could cause more confusion. Kyle notes this as well.
The full serializable isolation is now supported in the YSQL API. We plan to do Jepsen testing of the YSQL API this summer as part of its general availability release.
We knew the 1.1.9 release had an issue which had already been fixed in 1.1.10. So we wanted to Kyle to start the Jepsen testing on the 1.1.10 release but hadn’t yet updated the download and install documentation on the site. And Kyle (rightfully so) ended up starting with the 1.1.9 release!
In order to enable debugging certain failure scenarios, we used the
libbacktrace library to print stack traces of code-paths. When adding this feature, we thought that this library would rarely be exercised. Only when there is a failure scenario (which is rare) and it needs to be debugged by turning on very verbose logging (which is even rarer). But Kyle hit this precise issue, which he writes about:
This was easy enough to fix by switching to a different library.
Continuous improvements to YCQL features is an on-going effort. Additionally, we plan to run Jepsen tests on the PostgreSQL-compatible YSQL API, in preparation for its general availability in the 2.0 release (Summer 2019). YSQL is also built on the same underlying distributed document store as YCQL. Kyle has all of this in his report:
- Download the official Jepsen report as well as sign-up for the “Reviewing Jepsen Test Results for Correctness in YugabyteDB 1.2” webinar on April 30 hosted by Kyle Kingsbury and Karthik Ranganathan.
- Compare YugabyteDB in depth to databases like CockroachDB, Google Cloud Spanner and Amazon Aurora.
- Get started with YugabyteDB on macOS, Linux, Docker, and Kubernetes.