An Introduction to Distributed SQL: Glossary of Terms
In 2017 we introduced YugabyteDB, an open source, high performance, cloud native database for mission-critical applications. As a team, we have worked first hand on a number of databases such as Apache HBase, Apache Cassandra (from even before it was open sourced), Oracle, and RocksDB. We were the team that built and ran Facebook’s NoSQL platform that powered a number of user-facing, real-time applications. Over the years, we have had the opportunity and good fortune to become intimately familiar with what it takes to power mission-critical applications in a cloud-native architecture.
In this blog post, we will take a step back and look at some of the fundamentals and history of databases.
Many enterprises use a relational database management system (RDBMS), which is a database system that allows administrators to create, update, and delete records in a relational database. A relational database leverages a relational model to identify and access data in relation to another data point or other tables in the database. Within an RDBMS, data is often organized into one or more tables made up of columns and rows.
The RDBMS has been prevalent for almost four decades now, and SQL has been the lingua franca of the relational database world during that time. Therefore, relational databases are also known as SQL databases. Traditional SQL databases include Oracle, PostgreSQL, and MySQL. These relational databases came of age during the 1980s and 1990s, developers were focused on client-server applications to manage businesses. Traditional RDBMSs have been at the heart of enterprise use cases such as: online transaction processing (OLTP), sales and employee data, customer relationships, supply chains, and more.
NoSQL databases like MongoDB and Apache Cassandra came into prominence in the mid-to-late 2000s. They were originally positioned as alternatives to SQL databases but fell short of their goal. SQL databases were monolithic at that time and the distributed nature of NoSQL was attractive to applications looking for scalability and resilience. Since the various NoSQL languages focused on single-row (aka key-value) data models and gave up on the relational/multi-row constructs of the SQL language, these databases could no longer be categorized as “SQL” databases. Hence, the name “NoSQL” was adopted. NoSQL was originally coined as “No support for SQL” but later on evolved to “Not only SQL” once users realized that NoSQL databases have to coexist alongside the SQL databases rather than replace them entirely. The primary reason for continued need of SQL databases was the need for relational data modeling with support for single-row consistency as well as multi-row ACID transactions. Eventually-consistent NoSQL databases follow the Available and Partition-tolerant (AP) side of the CAP Theorem and hence are architecturally unfit for serving such consistency-first needs.
However, big changes have been happening in the NoSQL world in the last several years. Multiple old and new NoSQL databases alike have embraced one or more flavors of ACID transactions. NoSQL is becoming transactional with the goal of making application development, deployment, and operations simpler, faster, and more resilient than ever before. (SQL is also becoming distributed for exactly the same reason.) The good news with YugabyteDB is that you don’t have to choose between SQL and NoSQL because YugabyteDB brings the best of SQL and NoSQL together into a single database platform.
Read more about the differences between SQL and NoSQL i
NewSQL databases were introduced starting early 2010s to solve the write scalability challenge associated with monolithic SQL databases. 451 Research’s Matthew Aslett coined the term in 2011 to categorize these new “scalable” SQL databases.
NewSQL databases come in two distinct flavors:
- The first flavor simply provides an automated data sharding layer on top of multiple independent instances of monolithic SQL databases. For example, Vitess addresses this problem in the context of MySQL while Citus (aka Azure DB for PostgreSQL – Hyperscale) does the same for PostgreSQL. Since each independent instance is still the same old monolithic database, fundamental problems such as native failover/repair and distributed ACID transactions that span shards either remain impossible or perform poorly for real-world use cases. Worst of all, developers lose the agility that comes with interacting with a single logical SQL database.
- The second flavor includes the likes of NuoDB, VoltDB and Clustrix that built new distributed storage engines with the goal of keeping the single logical SQL database concept intact.
NewSQL databases allowed multiple nodes to be used in the context of a SQL database without making any significant enhancements to the replication architecture. The cloud was still at its infancy at that time they came into prominence, so this strategy worked for a few years. However as multi-zone, multi-region and multi-cloud cloud deployments became the standard for modern applications, these databases fell behind in developer adoption.
Read more in the following blog posts:
Enterprises are overwhelmingly moving to cloud-native applications powered by a microservices architecture. These applications run on elastic cloud infrastructure such as serverless frameworks and containers. And they are geographically distributed across multiple zones, regions, and clouds.
Traditional SQL databases have been monolithic in architecture and hence run on a single write node backed by specialized, high cost hardware. Adding more automation in terms of replication, fast failover, backup/recovery, or running inside Kubernetes does not change the fact that a single write node is constrained by processing power and therefore cannot serve high-throughput transactional apps with low latency. Traditional SQL databases of the past were not made for the needs of modern application: massively scalable architecture, built in mechanisms for failover and repair, and native multi-zone/multi-region/multi-cloud replication.
Architecturally, geo-distributed SQL databases bring the same horizontal scalability and extreme resilience capabilities to the data tier that are now taken for granted in the microservices and cloud infrastructure tiers. They do so without compromising on SQL as a flexible data modeling and query language while also supporting fully-distributed ACID-compliant transactions.
Read more in our posts:
- Rise of Globally Distributed SQL Databases – Redefining Transactional Stores for Cloud Native Era
- What is Distributed SQL?
In today’s containerized, multi-cloud world, the question to ask is: “Are there databases available today that can better exploit cloud native infrastructure?” The answer is yes.
Let’s start with the Cloud Native Computing Foundation (CNCF) definition for Cloud Native technologies.
One important principle that is not explicitly stated in the above definition is that of global distribution. Adrian Cockroft, VP Cloud Architecture Strategy at AWS, rightly points out in Mapping Your Stack that the ability to become instantly global is core to cloud native technologies. This is because the modern cloud is inherently global across multiple regions and not simply a way to rent hardware in a single region.
If we apply the above definition to the current CNCF Landscape, a few databases stand out. A new breed of cloud native databases are rising to meet the user demand with scalability, resilience, observability, manageability, and portability built into the core architecture more so than the legacy databases of the past. YugabyteDB aims to lead this pack with its unique approach of enabling multi-API, multi-model application development on top of a single transactional, high performance, globally distributed storage engine. You can learn more in our post “Docker, Kubernetes, and the Rise of Cloud Native Databases”.