Distributed ACID Transactions with High Performance
ACID transactions are a fundamental building block when developing business-critical, user-facing applications. They simplify the complex task of ensuring data integrity while supporting highly concurrent operations. While they are taken for granted in monolithic SQL/relational databases, distributed NoSQL/non-relational databases either forsake them completely or support only a highly restrictive single-row flavor (see sections below). This loss of ACID properties is usually justified with a gain in performance (measured in terms of low latency and/or high throughput).
YugabyteDB is a high-performance distributed SQL database that supports fully distributed ACID transactions across multiple rows, multiple shards, and multiple nodes at any scale. Just as an inspirational leader once said, Yes We Can! This post highlights the state of ACID transactions in distributed databases today and also explains how YugabyteDB makes distributed ACID transactions work without compromising on high performance.
A transaction is a sequence of operations performed as a single logical unit of work. A transaction has four key properties — Atomicity, Consistency, Isolation and Durability — commonly abbreviated as ACID.
- Atomicity refers to all the work in a transaction being treated as one atomic unit — either all of it is performed or none of it is.
- Consistency ensures that the database is always in a consistent internal state. For example, in the case of tables with secondary indexes, the primary table and all the index tables should be consistent after an update.
- Isolation determines how/when changes made by one transaction become visible to the other. Serializable and Snapshot Isolation are the top 2 isolation levels from a strictness standpoint.
- Durability ensures that the results of the transaction are permanently stored in the system. The modifications must persist even in case of power loss or system failures.
In the context of distributed databases, ACID transactions can be internally classified into the following three flavors.
Transactions where all the operations impact only a single row (aka key) are called as single-row ACID transactions. Since the data for a single row typically doesn’t cross the boundary of a single node in most distributed databases, single-row ACID transactions are easier to achieve in the distributed DB world. However, most NoSQL databases cannot support even this flavor of transactions because of their eventually-consistent storage engines that has no inherent guarantee on the correctness of the data read. Notable exception is MongoDB, which supports multi-document distributed transactions.
A marginal improvement over single-row ACID is single-shard ACID where all the rows involved in the operations of a transaction are located in a single shard of a distributed database. Since a single shard is always located inside a single node, this flavor too doesn’t involve coordinating transaction operations across multiple nodes and hence is easier to implement in a distributed DB. E.g. MongoDB announced the intent to support single-shard transactions starting in its 4.0 release.
In auto-sharded distributed databases like YugabyteDB, Google Cloud Spanner, and Azure Cosmos DB, shards are spread across multiple nodes by design on cluster creation. A transaction that impacts a set of rows distributed across shards on multiple nodes is called a distributed ACID transaction. Implementing distributed ACID transactions in scale-out databases requires the use of a transaction manager that can coordinate the various operations and then commit/rollback the transaction as needed. Popular NoSQL databases are designed to avoid this additional complexity in the fear that they will have to compromise on performance (in the form of increase in write latency and decrease in linear scalability). As we see below, this fear is not grounded in reality and it is indeed possible to increase application developer agility through a single distributed database that has both distributed ACID transactions and high performance. FoundationDB, open sourced by Apple, follows a similar design philosophy where transactions are handled using a transactional authority.
YugabyteDB’s sharding, replication, and transactions design is inspired by the original Google Spanner paper published in 2012. The goal is to serve distributed transactions with high performance and without compromises to correctness. Since distributed transactions are not the only kind of transactions apps need, YugabyteDB efficiently detects and manages different scenarios involving single-row/single-shard transactions (using per-shard consensus without any 2 Phase Commit) and distributed transactions (with a 2 Phase Commit). For the sake of simplicity, we are going to look at the general case of how distributed transactions operate on multiple keys spread across many nodes.
The following steps outline the lifecycle of a transaction in YugabyteDB.
YugabyteDB has a built-in transaction manager to manage the lifecycle of transactions. The transaction manager runs on every node (as a part of the tablet server process) in the cluster and automatically scales out as nodes are added.
Since the transaction manager is stateless, incoming transactions can be routed to any node. Any transaction manager in the cluster can manage incoming transaction. In order to optimize the performance of the transaction, the Yugabyte Query Layer (YQL) tries to schedule the transaction on the tablet server that owns most of the data accessed by the transaction.
Transactions need to be tracked in a reliable and fault-tolerant manner. This is essential since a transaction might attempt to update multiple keys, but could fail or get aborted at any intermediate step.
In order to track the transaction, a new entry is created for the transaction in a transaction status table. The following information is stored as a part of this transaction entry.
- Transaction ID which is a UUID that uniquely identifies the transaction.
- Status which can be one of pending, committed, or aborted. All transactions start out in the pending status, and progress to the committed or aborted status, in which they remain permanently until cleaned up.
- Commit hybrid timestamp which is the commit timestamp for the transaction. This is the final timestamp used for multi-version concurrency control (MVCC) to serve the various updates made by the transaction if it gets committed.
- List of IDs of participating tablets which are the final set of tablets that the transaction has written to. Read more about this in step #6 which describes how provisional writes are cleaned up.
Even at this point in the transaction’s lifecycle, it is not possible to predict if any of the updates will conflict with the updates of another transaction. Therefore, YugabyteDB writes provisional records to all tablets responsible for the keys the transaction is trying to modify. These records are provisional because they are invisible to readers until the transaction commits.
In addition to storing the data being updated by the transaction, the provisional records also act as a persistent revocable lock. These locks, represented by provisional records, can be revoked by another conflicting transaction. The conflict resolution subsystem makes sure that for any two conflicting transactions, at least one of them is aborted.
Furthermore, transaction metadata records are also stored in order to efficiently find the following information for a given transaction:
- the transaction coordinator
- the priority, to decide which of the conflicting transactions to abort
- all provisional records written as a part of the transaction in this tablet
This is done in order to increase the performance of the commit step of the transaction discussed in Step 4b below.
When multiple transactions run concurrently, they may try to update the same set of keys. When this happens, there is a likelihood that these updates might violate the isolation guarantees of the running transaction unless conflicts are detected and handled. The set of conflicting updates depend on the operation (read vs. write) and the isolation level guarantee (Serializable vs. Snapshot Isolation).
The table below lays out a simple view of the conflicts that must be detected and resolved. The resolution is generally to internally restart one of the conflicting transactions at a later point of time when the conflict can be resolved.
YugabyteDB currently detects and handles write-write conflicts at Snapshot Isolation level automatically. The other conflicting operations will be handled and resolved automatically as part of the upcoming Serializable isolation level support.
Once the transaction manager has successfully written all the provisional records, it proceeds to commit the transaction by sending an RPC request to the transaction status tablet. The commit operation will only succeed if the transaction has not yet been aborted due to conflicts. The atomicity and durability of the commit operation itself is guaranteed by the transaction status tablet’s Raft group. Once the commit operation is complete, all provisional records immediately become visible to clients.
The commit timestamp is chosen as the current hybrid time at the transaction status tablet at the moment of appending the transaction committed entry to its Raft log. It is then used as the final MVCC timestamp for regular records that replace the transaction’s provisional records when provisional records are being applied and cleaned up.
Once the transaction has been committed, the transaction manager acknowledges back to the client that the transaction was successful.
The transaction status tablet sends cleanup requests to each of the tablets that participated in the transaction. This can be done efficiently since the list of IDs of participating tablets is stored as a part of the transaction entry.
The clean up request contains a special apply record with the transaction ID and commit timestamp. Upon receiving this request, the tablet removes the provisional records belonging to the transaction, and writes permanent records with the correct commit timestamp.
Once all participating tablets have successfully processed these apply records, the status tablet can delete the transaction status record. The deletion of the status record happens by writing a special applied everywhere entry to the Raft log of the status tablet. Raft log entries belonging to this transaction will be cleaned up from the status tablet’s Raft log as part of regular garbage-collection of old Raft logs soon after this point.
Now the data is ready to be served as documented in Transactional Read path.
There are a number of techniques used in YugabyteDB in order to achieve high performance while retaining distributed ACID transactional guarantees. A few of these are outlined below.
A given transaction that is in progress may want to look up its own metadata (such as the tablets it has updated). Other transactions may also want to lookup information about conflicting transactions and abort them depending on the relative priority. YugabyteDB aggressively caches information for transactions that are in progress to make the lookups very fast and efficient.
YugabyteDB supports fine-grained locking in order to perform conflict resolution. This is essential to handle distributed transactions with performance in a document-oriented database (which YugabyteDB is at its core). Without fine-grained locks, transactions that update non-overlapping attributes in a document may end up contending on one another.
The image below shows the types of fine-grained locks YugabyteDB has implemented.
For example, if a transaction modifies a single column within a row in a table, the following fine-grained locks might be taken:
- A weak lock is taken on the row (document root) to ensure that it cannot be simultaneously deleted by another transaction, but allows transactions that update columns to continue in parallel.
- A strong lock is taken on the column being updated to serialize the updates to that column in order to guarantee consistency.
Every read request in YugabyteDB is assigned a particular hybrid time (for MVCC) — the read hybrid timestamp. This allows write operations to the same set of keys to happen in parallel with reads, ensuring high performance.
It is crucial, however, that the view of the database as of this read hybrid timestamp is not updated by concurrently happening writes. This is done to ensure that successive reads of the data at a particular timestamp do not see different results if the read operation is retried. In order to achieve this, YugabyteDB uses the concept of hybrid time leader leases. YugabyteDB handles the intricate dependencies between the following operations to ensure correctness:
- a single row update for a key
- a provisional record written for a row before a transaction is finalized
- reading the value of the row from the DB
YugabyteDB has been specifically designed to be able to detect two flows without compromising correctness.
- The single-row ACID transactions are optimized to have low latencies when there is no conflicting operation.
- In case there are conflicting operations, or the query is attempting a distributed/multi-shard transactional operation where provisional records need to be written, YugabyteDB automatically switches to preserving correctness even if it means higher latencies.
Thus, YugabyteDB allows batching of write operations with single-row ACID semantics to achieve very high performance on streaming ingest, while at the same time it allows transactional updates to a set of rows with consistency.
Transactions are an essential feature when developing user-facing web and mobile applications. Typically, a large percentage of workloads in such an application requires high performance with single-row ACID guarantees, while a small percentage of the workloads require distributed ACID transactions with extreme data integrity. YugabyteDB is designed to strike a good balance in addressing both scenarios. The end goal is radical simplification of application development through a unified data platform that brings together SQL, internet-scale growth, and global distribution. Two areas we continue to work on are Jepsen testing for data correctness validation under failure conditions and additional performance benchmarks for workloads with distributed transactions. Stay tuned for results.
If our approach to database design sounds exciting, try out YugabyteDB — it is Apache 2.0 open source! Install it today, and check out how distributed ACID transactions work on your laptop in a few minutes. Also, don’t forget to support our GitHub project with your stars, watches, and forks. 🤓
- Review results of Do-It-Yourself Jepsen testing (formal certification is on roadmap).
- Compare YugabyteDB in depth to databases like CockroachDB, Google Cloud Spanner, and MongoDB.
- Get started with YugabyteDB on macOS, Linux, Docker, and Kubernetes.
- Contact us to learn more about licensing, pricing, or to schedule a technical overview.