Came across the image below here and this made me smile. Not because of the implied complexity of choosing a database, but the reality with which this flow chart captures the state of the database world today in 2017. Of course, running whatever database you end up choosing in production is a whole another order of complexity.
I have been working on distributed systems for the last 10+ years. Worked on Cassandra before it was open-sourced or even called Cassandra, and worked on HBase at Facebook for a bunch of use-cases holding 100’s of petabytes of data. All of this started at the time period before NoSQL was a thing, and before people were reading about NoSQL vs SQL till their eyeballs fell out. So wanted to share some of my thoughts.
Initially, RDBMS’s were the dominant databases. Needed to scale? Get a beefier machine. This was the age when DB2, Oracle, SQLServer, MySQL, Postgres and others became popular. Soon it became clearer that purely scaling up RDBMSs by getting beefier machines is not going to solve the problem of storing their data — data which started pouring in like a firehose. And even if it somehow did, it would be prohibitively expensive.
The natural next step of the solution was to build on top of sharded SQL databases. The single node SQL database was split into multiple logical databases, data was sharded at the application layer and replication/HA managed by the operations team. This meant giving up on the purely “relational” aspects of the RDBMS — such as joins, foreign keys, etc and denormalizing the schema. This solution is a workable one, though obviously very hard one to implement across the board (from the application development teams to the operational teams).
If you think about it, the above solution is very repeatable — especially if the users of the database have mentally prepared themselves to give up the relational aspects of the RDBMS. The solution can be expressed very concretely:
- implicitly shard data to enable scale out
- automatically replicate data to survive failures
- seamlessly handle node failures to provide HA (using replicas)
- allow for a simple way to express sharding, querying, replication
In order to provide the above solution, the “NoSQL” databases were born. But the first generation databases focussed a lot on the language used to express sharding and querying, and not enough on the production grade stability, consistency and data replication/HA guarantees. The Wikipedia definition of NoSQL from here is:
A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
So the focus quickly became how SQL as a language was not ideal, and how you needed a different programmatic API with concepts such as byte keys, columns, column families, etc. The consistency, replication and HA guarantees somehow got lost in all the confusion, resulting in databases that are not quite workable for production grade systems. And to rub salt on that wound, somehow Redis gets classified as a NoSQL database — I always think of it as a Memcache++. Thanks to all this, to the end user doing their research on what system to use, the database space is incredibly complicated.
Instead of NoSQL being “non SQL” databases, the focus should instead have been on “non relational SQL” databases (yeah, the name doesn’t sound too cool). Folks realized that changing the query paradigm should not have been the core focus. So, a bunch of databases were built or enhanced with a SQL-like query language, and the term NoSQL simple got changed to “Not only SQL”. Problem solved? Not quite – the consistency, replication and HA guarantees are still not addressed.
So we come to today, when sharded (or scale up in some cases) SQL databases are the most reliable for running mission critical OLTP applications. NoSQL databases (and Redis) are used along side them for the ever-growing datasets or faster data access. This makes running data infrastructure in production very hard as there are multiple systems and data has to be synchronized across them. And if you’re running across multiple regions/DCs or in a cloud environment, “very hard” is a big understatement.
One of the readers asked about how a clustered SQL solution fits into all this. This is a great question.
Here is a link describing what a clustered SQL solution is: https://www.brentozar.com/archive/2012/02/introduction-sql-server-clusters/
The section on what clustering does not help with in the above link is relevant.
1. Clustering won’t improve your performance. So this is not a scale out solution, its a scale up solution with HA
2. Clustering doesn’t save you space or effort for backups or maintenance. So it does not make things operationally easy
3. Clustering also won’t help you scale out your reads. Again not a scale out solution.
4. Clusters won’t give you 100% uptime. So its a higher level of HA, but still has manual bits.
Therefore, clustering falls in the “scale up” category. Clustering can be combined with manual sharding to achieve scale out and HA, but it is an extremely complicated solution to build.
Would love to hear from folks running or thinking about running sharded SQL and/or NoSQL in production — what are your thoughts? What issues do you face?