Enhancing RocksDB for Speed & Scale
As described in our previous post “How We Built a High Performance Document Store on RocksDB?”, YugabyteDB’s distributed document store (DocDB) uses RocksDB as its per-node storage engine. We made multiple performance and data density related enhancements to RocksDB in the course of embedding it into DocDB’s document storage layer (figure below). These enhancements are distributed as part of the YugabyteDB open source project. The goal of this post is to deep dive into these enhancements for the benefit of engineering teams interested in leveraging RocksDB beyond its original design intent of a fast monolithic key-value store.
Document Storage Layer in DocDB
We enhanced RocksDB’s block cache to become scan resistant. This prevents operations such as long-running scans (e.g., due to an occasional large query or a background Spark job) from polluting the entire cache with poor quality data and wiping out useful/hot data. The new block cache uses a MySQL/HBase like mid-point insertion strategy where the LRU is divided into two portions and multiple touches to a block are required before it is promoted to the multi-touch/hot portion of the cache.
RocksDB’s SSTable files contain data and metadata such as indexes & bloom filters. The data portion of an SSTable file was already chunked into blocks (32KB default) and demand-paged.
However, the bloom filter & index portions were monolithic and needed to be brought into memory in an all-or-nothing manner. For large datasets, this put unnecessary pressure on memory requirements while also causing fragmentation.
We enhanced RocksDB’s index and bloom filters to be multi-level/block-oriented structures so that these metadata blocks can be demand-paged into the block cache much like data blocks. This enables YugabyteDB to support very large data sets in a RAM efficient and memory allocator friendly manner.
DocDB auto shards tables into multiple tablets. It dedicates one RocksDB instance per tablet as opposed to sharing a single RocksDB instance across multiple tablets on a node.
- Cluster rebalancing on node failure or node addition becomes extremely efficient because the SSTable files of tablets being rebalanced can be copied as is (in their compressed form) from tablet leaders. Unlike a scheme where multiple tablets share a single RocksDB instance, no logical scan or splitting of SSTable files is required. And no need to wait for compactions to reclaim storage from the nodes a shard previously lived on!
- Deleting a table is as simple as dropping the related RocksDB instances.
- Allows for per-table storage policy
- On-disk compression – on/off? which compression algorithm? etc.
- In-memory delta-encoding scheme – e.g., prefix-compression or diff encoding
- Allows for per-table bloom-filter policy. For tables with compound primary keys (say
(<userid>, <message-id>)), which portion(s) of the key are added to the bloom-filter depends on the access pattern to optimize for. Adding the entire compound key to the bloom is useless, and pure overhead, if application queries are commonly going to provide only one portion of the compound key (say just the
- Allows DocDB to track min/max values for each clustered column in the primary key of a table (another enhancement to RocksDB) and store that in the corresponding SSTable file as metadata. This enables YugabyteDB to optimize range predicate queries like by minimizing the number of SSTable files that need to be looked up:
SELECT metric_val FROM metrics WHERE device=? AND ts < ? AND ts > ?
RocksDB already allows for running multiple instances within a single process. However, in practice, using RocksDB in this form has required some careful engineering & enhancements. We have listed some of the important ones here.
DocDB uses a shared block cache across all instances of RocksDB on the server. This avoids per-tablet cache silos and increases effective utilization of memory.
RocksDB allows a per-memstore flush size to be configured. This is not sufficient in practice because the number of memstores may change over time as users create new tables, or tablets of a table move between servers due to load balancing. Picking a very small value for the per-memstore flush size results in premature flushing and increases write amplification. On the other hand picking a very large per-memstore flush size, for a node with lots of tablets, increases memory requirement on the system and also the recovery time (in case of server restart).
To avoid these issues, we enhanced RocksDB to enforce a global memstore limit. When the memory used by all memstores reaches this limit, the memstore with the oldest record (determined using hybrid timestamps) is flushed.
RocksDB supports a single compaction queue with multiple compaction threads. But this leads to scenarios where some large compactions (by way of the amount of data to read/write) get scheduled on all these threads and end up starving the smaller compactions. This leads to too many store files, write stalls, and high read latencies.
We enhanced RocksDB to enable multiple queues based on the total input data files size in order to prioritize compactions based on the sizes. The queues are serviced by a configurable number of threads, where a certain subset of these threads are reserved for small compactions so that the number of SSTable files doesn’t grow too quickly for any tablet.
DocDB supports a just-a-bunch-of-disks (JBOD) setup of multiple SSDs and doesn’t require a hardware or software RAID. The RocksDB instances for various tablets are balanced across the available SSDs uniformly, on a per-table basis to ensure that each SSD has a similar number of tablets from each table and is taking uniform type of load. Other types of load balancing in DocDB are also done on a per-table basis, be it:
- Balancing of tablet replicas across nodes
- Balancing of leader/followers of Raft groups across nodes
- Balancing of Raft logs across SSDs on a node
As noted in our previous post, DocDB manages transactional processing, replication, concurrency control, data time-to-live (TTL), and recovery/failover mechanisms at the overall cluster level as opposed to the per-node storage engine level. Therefore, some of the equivalent functionality provided by RocksDB became unnecessary and were removed.
DocDB uses the Raft consensus protocol for replication. Changes to the distributed system, such as row updates, are already being recorded and journaled as part of the Raft logs. The additional WAL mechanism in RocksDB is unnecessary and would only add overhead.
DocDB avoids this double journal tax by disabling the RocksDB WAL, and instead relies on Raft log as the source of truth. It tracks the Raft “sequence id” up to which data has been flushed from RocksDB memtables to SSTable files. This ensures that we can correctly garbage collect Raft logs, as well as, replay the minimal number of records from Raft WAL logs on a server crash or restart.
DocDB manages MVCC on top of RocksDB at the overall cluster level. The mutations to records in the system are versioned using hybrid timestamps. As a result, the notion of MVCC as implemented in a vanilla RocksDB using sequence ids adds overhead. For this reason, DocDB does not use RocksDB’s sequence ids, and instead uses hybrid timestamps that are part of the encoded key to implement MVCC.
Hybrid Logical Clock (HLC), the hybrid timestamp assignment algorithm, is a way to assign timestamps in a distributed system such that every pair of “causally connected” events results in an increase of the timestamp value. Please refer to these reports (#1 or #2) for more details.
DocDB is the distributed document store that powers YugabyteDB. RocksDB’s immense popularity as a fast embeddable storage engine combined with its Log-Structured Merge trees (LSM) design and C++ implementation were the critical factors in selecting it as DocDB’s per-node storage engine. Every row managed by YugabyteDB is stored as a document in DocDB that internally maps to multiple key-value pairs in RocksDB.
As we started building the document storage layer in DocDB, we realized that we need to enhance RocksDB significantly. This is because each RocksDB instance could no longer operate in isolation. It needed to share resources with other RocksDB instances present on the same node. Additionally, we needed to ensure that each DocDB tablet (that maps to its own dedicated RocksDB instance) can grow to become arbitrarily large without impacting the performance of other tablets (and the associated RocksDB instances). We hope the enhancements highlighted in this post can help other engineering teams looking to solve similar performance and scalability challenges on top of RocksDB.