Distributed SQL Deep Dive: Inside YugabyteDB’s Two-Layer Architecture
YugabyteDB is a 100% open source, distributed SQL database system. This single phrase expresses two distinct notions: a SQL database system, and a distributed database system. Historically, these notions were mutually exclusive. But current technology allows a single system to implement both notions. YugabyteDB does this with its two-layer architecture: an extensible query processing layer and a distributed document store.
In this blog post, we explain how YugabyteDB’s two-layer architecture works and compare it against other popular databases. We also examine the benefits it brings to cloud native applications.
Enterprise applications execute complex queries and transactions that go beyond the simple storing and loading of values or documents by a unique key. This is why relational databases and SQL were invented: to provide a consistent view of the enterprise information system, even when updated by concurrent access and complex business transactions.
This has been easier with a centralized single point of truth: monolithic databases. However, there is a need to scale beyond one server, to ensure availability of data beyond any failure or maintenance on servers and provide high performance at web-scale. This is why NoSQL databases removed many features to partition the database and access it across multiple nodes that don’t share any state.
Today’s database technology allows a combination of these features: distribute data into multiple servers and allow fully consistent cross-node transactions. The algorithms were there for a long time (i.e. Lamport Clocks, the Raft consensus protocols, and hash and range sharding). But what has changed is the infrastructure (i.e., cloud, containers, and automation) and hardware (i.e., SSD instead of mechanical drives and processors with more cores).
DocDB functions as the storage layer—or lower half—of YugabyteDB’s two-layer architecture. It distributes everything, such as storage and transactions. The internal API operates on a batch of key-value read/write operations. This is similar to many NoSQL databases. However, YugabtyeDB is fully ACID compliant, so the consistency processes as additional data, with transaction control operations and provisional records (which we call intents).
Even single-row operations need this transaction layer to provide strong consistency with global secondary indexes. A single-row transaction requires multiple read/write operations, with all ACID properties. This layer is mandatory for the SQL persistence but also to be fully ACID for key-value access. The DocDB layer is YugabyteDB’s strongly consistent and highly available distributed document database. An optimized RocksDB fork running each shard handles DocDB’s low-level storage. But the major Yugabyte innovation is the distributed durability, transaction, and replication provided by DocDB.
YugabyteDB’s query processing layer, referred to as YSQL, provides a lot more than the API, protocol, SQL syntax, and behavior. Since DocDB processes all distributed transaction, persistence, and availability concerns, the main characteristic of this query processing layer is its stateless nature. This allows for more complex processing, without the constraints of synchronizing states and caches across multiple nodes.
In the following sections, I’ll compare YugabyteDB’s two-layer architecture with other architectures. I’ll also explain which layer completes an operation within the database.
The stateless property of YugabyteDB’s SQL processing layer is what allows all current—and future—features that are only possible with this architecture. Other distributed SQL databases can add complex SQL features, but they must be integrated one at a time, with inherent limitations. This is because they have to run on a stateful distributed component. And each new feature on a single-layer distributed database comes with more complexity—and less scalability—because there is a shared structure to synchronize.
I’ll first compare YugabyteDB with Oracle RAC, the scale-out solution for Oracle Database. RAC is a shared-everything architecture. Each shared component brings new limitations, such as rolling upgrades not being possible for major versions.
Of course, RAC is an awesome piece of technology, but it can be complex—and expensive. And it solves only part of the problem (i.e., high availability of the instance but not of data). You need to combine it with other maximum availability architecture (MAA) components.
Amazon Aurora’s storage layer is distributed, to scale it out, but the transaction layer is still interlinked with the SQL layer. This means that write operations cannot scale out, so they’re fast but bound to one server only. There’s also the shared buffer cache directly accessed by the query processing layer. This means that scaling-out the reads still need some synchronization with remote caches.
Both of these databases have limited scale-out capabilities, constrained by short distance and one active writer for shared data. In YugabyteDB, with the storage and transaction processing pushed down to the lower layer, we have a stateless layer for complex processing of user commands and queries.
At the query processing layer, “compatible” is a weak word: YSQL is actually re-using PostgreSQL. This popular, open source relational database plugs into the top of the distributed storage layer. Reusing PostgreSQL opens up access to all the features and ecosystem of applications, tools, and frameworks. But this would not be possible in a stateful layer where each module would have to synchronize across nodes.
For example, PostgreSQL reads and writes go through its (local) shared buffer cache. This cannot scale out as-is. But YugabyteDB’s query processing layer doesn’t need it: tuples go to DocDB rather than the shared_buffers.
This architecture choice is not only about current features. It allows YugabyteDB to evolve with newer versions of PostgreSQL, including new APIs, protocols, and languages. In short, this two-layer architecture gives back everyday.
Because PostgreSQL is the most popular open source database that is close to the SQL standard, many databases aim at PostgreSQL compatibility. But with PostgreSQL as a query processing layer, all features come complete with identical behavior. The effort is mostly in optimizing them, making the query planner aware of the distribution, and pushing down some operations to limit moving data around. This is also a must for an open source database because it allows PostgreSQL contributors to add features without the need to understand the deepest details of the distributed protocols.
DocDB gets the read operations as queries on a key (for “point” queries, known as unique key lookups in the SQL world) or a range of keys (to seek and read). The change vectors for a key send the write operations. Those operations target “tablets” which are splits of tables and indexes (which are the same structure in DocDB).
The distribution to the tablets looks like a sharded database, and you may see some similarities with CitusDB. If you do, it is important to understand that YugabyteDB shards at a different layer. With Citus, sharding finishes early in the SQL processing. This means that as soon as the SQL statement splits into subqueries by a coordinator, the subqueries run on small monolithic databases. There are no cross-shard foreign keys, no global indexes, and no high performance cross shard transactions. This is great for a SQL data warehouse, or for multi-tenancy with a small number of isolated tenants.
But in order to run distributed SQL, even the simplest one-row insert is a cross-shard transaction. This is why YugabyteDB shards at the DocDB level. The transformation from complex SQL to key-value operation finishes up front, allowing all complex SQL features. Those key-value operations distribute with their transactional consistency intact. Without this DocDB layer, it would be impossible to get all SQL features and have shared-nothing scalability at the same time.
Something else happens in this DocDB level, which is probably the most important one for high availability: replication. Synchronous replication improves fault tolerance for the cluster, with no data loss. This, in turn, improves availability when there is a node failure. In monolithic databases, database administrators hesitate between logical and physical replication.
For example, Amazon RDS streams the disk block changes to the standby site. Oracle Data Guard streams the redo vectors with the files and block changes. PostgreSQL sends the WAL, usually as full blocks. This has many advantages: you receive the same database in the replica, with the same performance when you failover to it. However, it lacks the agility that we need in a cloud-native environment.
Logical replication provides all agility: across different platforms, different regions, and even the ability to filter only part of the database. You can also dream about two-way replication. This is possible as you replicate logical operations and apply them as soon as you have the same logical schema. However, when applied at the query processing layer, the statements you replicate are complex, involving multiple tables, indexes, and transactions. This brings huge complexity to the replication software but also to the database administration. This is because multi-row and multi-table operations will have conflicts and will not be applied at the same time as in the source.
Solutions like Oracle GoldenGate handle those conflicts, but it is always a complex project to manage. This is where YugabyteDB finds the right balance. At the DocDB level, there are logical operations that replicate to nodes running different versions of the software. But there are no conflicts because the sharding was already done upstream, with each tablet having multiple tablet peers on different nodes. All the complexity of SQL queries and transactions has already been decomposed into simpler operations by the query processing layer. This makes logical replication simpler, autonomous, and safe—as reliable as physical replication, and as agile as logical replication.
YugabyteDB’s two-layer architecture is a real innovation for distributed SQL databases. Because they were single node only, legacy databases had their SQL executor reading and writing to the blocks in the same format that would be stored on the hard drive. From there, high availability was added by replicating those blocks.
For the future of distributed SQL databases, YugabyteDB has an intermediate layer where all changes are appended in an ordered sequence of simple per-tablet operations. This layer can be protected with Write Ahead Logging, and replicated with the Raft protocol to nearby or distant nodes—in sync or async—with the same or different database version.
To learn more about YugabyteDB’s two-layer architecture, check out this tech talk with the creators of the database.