Our post 6 Signs You Might be Misunderstanding ACID Transactions in Distributed Databases describes the key challenges involved in building high performance distributed transactions. Multiple open source ACID-compliant distributed databases have started building such transactions by taking inspiration from research papers published by Google. In this post, we dive deeper into Percolator and Spanner, the two Google systems behind those papers, as well as the open source databases they have inspired.
Percolator is Google’s internal-only system used to make incremental updates to the Google search index. Google published its architecture (shown below) as a research paper in 2010. Percolator achieved a 50% reduction in the delay between page crawling (the first time Google finds a page) and page availability in the search index, in comparison to the MapReduce-based batch processing system it replaced.
Percolator is designed on top of BigTable, Google’s original wide-column NoSQL store first introduced to the world in 2006. It adds cross-shard ACID transactions using a two-phase commit protocol on top of BigTable’s single row atomicity. This enhancement was necessary because the process of updating an index is now divided into multiple concurrent transactions, each of which has to preserve invariants such as the highest PageRank URL has to be always present in the index and link inversion to anchor text should work across duplicates.
Isolation Levels & Time Tracking
Percolator provides Snapshot Isolation, implemented using MVCC and a monotonically increasing timestamp allocated by a Timestamp Oracle. Every transaction requires contacting this Oracle twice, thus making the scalability and availability of this component a significant concern.
As highlighted in the Percolator paper itself (see below), it’s design is not suitable for an OLTP database where user-facing transactions have to be processed with low latency.
The design of Percolator was influenced by the requirement to run at massive scales and the lack of a requirement for extremely low latency. Relaxed latency requirements let us take, for example, a lazy approach to cleaning up locks left behind by transactions running on failed machines. This lazy, simple-to-implement approach potentially delays transaction commit by tens of seconds. This delay would not be acceptable in a DBMS running OLTP tasks, but it is tolerable in an incremental processing system building an index of the web. Percolator has no central location for transaction management; in particular, it lacks a global deadlock detector. This increases the latency of conflicting transactions but allows the system to scale to thousands of machines.
Open Source Databases Inspired By Percolator
TiDB / TiKV
TiDB, a MySQL-compatible distributed database, uses Percolator as the inspiration for its distributed transaction design. This choice makes sense given TiDB’s focus on throughput-oriented Hybrid Transactional/Analytical Processing (HTAP) use cases that are not latency sensitive and usually run in a single datacenter. TiKV is the underlying distributed key-value layer.
Percolator’s primary limitation is the lack of multi-region distributed transactions with low latency. Such transactions are absolutely necessary to power geo-distributed OLTP workloads (think Gmail, Calendar, AdWords, Google Play and more). Google’s answer to this problem is another internal project called Spanner, first introduced to the world in 2012. Spanner started out in 2007 as a transactional NoSQL store with a key-value API and support for cross-shard transactions, external consistency, and transparent failover across datacenters. Since then it has evolved into a globally distributed SQL-based RDBMS that today underpins almost every mission-critical service at Google. A subset of the Spanner system was made publicly available in 2017 on the Google Cloud Platform as a proprietary managed service called Google Cloud Spanner.
Storage & Replication Architecture
Unlike Percolator, Spanner’s architecture is not based on BigTable. It resembles Megastore more closely and uses Colossus as its file system. A Spanner deployment is called a “universe”. A universe auto shards and auto rebalances data across many sets of Paxos state machines located in multiple zones at datacenters spread all over the world. These shards are also replicated for global availability and geographic locality; clients automatically failover between replicas.
A zone has one zonemaster and possibly 1000s of spanservers. The former assigns data to spanservers; the latter serve data to clients.