Q&A: Partitioning vs Sharding, Scaling Behavior, and Visualization Tools for YugabyteDB
Welcome back to the YugabyteDB Tips and Tricks series, where I have the pleasure of recapping distributed SQL questions from around the Internet. This month we will focus on the differences between partitioning and sharding in distributed SQL, how YugabyteDB’s unique implementation of RocksDB negates any scaling limitations, and the recommendation visualization tools for YugabyteDB.
Lets separate this into two separate questions:
- What is table partitioning and sharding?
- How does YugabyteDB utilize partitioning and sharding?
First, partitioning (or table partioning) is the process of splitting large tables into smaller, more manageable ones, providing multiple physically independent tables that are logically seen as one. Partitioning has been used for traditional (single node) databases in order to efficiently query large amounts of data. In essence you are grouping multiple subsets of data on a single node based on the column you choose to partition on and a strategy such as range or list.
Sharding (or database sharding) is the process of breaking up large tables, indexes, or partitions into smaller chunks called shards (or tablets in YugabyteDB) that are then distributed across multiple servers based on a hash or range of the primary key. Sharding is similar to partitioning in that you are breaking up a table into smaller pieces. It is different in the sense that you are distributing these pieces across multiple nodes.
In YugabyteDB you can use both concepts, so we understand why it can be confusing. Within YugabyteDB partitioning is a user-defined, SQL-level concept, thus requiring an explicit definition through SQL. Sharding on the other hand, and the load balancing of shards, is a storage level concept that is performed automatically by YugabyteDB based on your replication factor. As your data grows in size, the database will continue to create more shards and distribute them across your cluster. As you add nodes, the database will replicate them to the new nodes. This allows you to horizontally scale out with no effort to meet your high throughput requirements. Because partitioning is a SQL-level property that is dependent on the tablet definition, it is not automatic.
The main use of partitioning in YugabyteDB, or declarative partitioning as it is referred to in PostgreSQL, is to group your data together. For example, YSQL time series workloads can mimic the time-to-live (TTL) found in Cassandra or YugabyteDB’s YCQL API by partitioning on the time column and dropping the oldest partition. Another use case is to implement row-level geo-partitioning, when your data cannot leave a specific region (i.e. GDPR). Partitioning for either of these scenarios is done at the query layer using YSQL, separating itself again from sharding which is done at the storage layer (DocDB). Keep in mind that indexes created on partitioned tables are automatically partitioned.
For deeper insights check out Franck Pachot’s blog on Distributed SQL Essentials: Sharding and Partitioning in YugabyteDB and Dennis Magda’s demo-driven explanations on Partitioning in YugabyteDB and Sharding in YugabyteDB.
Learn more about YugabyteDB architecture.
There are third party users and blogs that claim that RocksDB does not scale well due to some sizing limitations on its data set size. However, this does not impact YugabyteDB.
YugabyteDB’s physical database implementation of DocDB is based off of RocksDB. However, we have made significant improvements to ensure that node density and scalability are two of our top benefits. In YugabyteDB, each tablet (shard) has its own RocksDB instance. This limits the tablet size to 10GB by default. If it grows beyond that, we automatically split the tablet. That being said, however, a tserver, which makes up the data layer and hosts every tablet on the node, can run hundreds of tablets given the proper amount of CPU and memory resources. Given the above there is no need to limit the amount of data per node.
To learn more about sharding with YugabyteDB and the ideal number of tablets and tablet size, read Distributed SQL Sharding: How Many Tablets and What Size. For those interested in getting deeper into the weeds of how auto-sharding works, check out Franck Pachot’s new blog, YugabyteDB Auto-Sharding.
Is there a tool for visualizing the data stored in my YugabyteDB cluster, or is anything built in YugabyteDB itself ?
A key benefit of YugabyteDB’s compatibility with Postgres and Cassandra is our ability to work with third party tools by reusing the already existing adapters for these databases. This includes popular visualization tools such as pgAdmin, DBeaver, and ArcType. This ability also extends to integrations such as Kafka, Spark, and Flyway, amongst others, as well as drivers and ORMS. Supporting popular tools and integrations that developers use in the ecosystem today was a major focus for the YugabyteDB team. This was a key reason for the decision to reuse the Postgres query layer (YSQL) and recreate Cassandra using C++ (YCQL).
A library of distributed SQL tips and tricks and general “how to” information can be found by searching the YugabyteDB blog, which is updated weekly, as well as our DEV Community Blogs. Some recent highlights include:
- How to Optimize Pagination for Distributed Data While Maintaining Ordering
- How to Build an Interactive Data Analytics Dashboard in Microsoft Power BI Using YugabyteDB
- How to Avoid Hotposts on Range-Based Indexes in Distributed Databases
- How to Set Up a Mechanism to Capture pg_stat_statements in a Persistent Table
- Stream Data to Amazon S3 Using YugabyteDB CDC
All Distributed SQL Summit (DSS) 2022 are available. If you missed the event, or missed a specific session, watch all the content on demand. In addition, check out some of our most popular “how to” content on the Yugabyte YouTube channel, including our YugabyteDB Friday Tech Talks (designed for engineers by engineers)
Check out the upcoming YugabyteDB events, including all training sessions, conferences, and in-person and virtual events.
This blog series would not be possible without the support of fellow Yugabeings such as Denis Magda, Dorian Hoxha, Franck Pachot, and Frits Hoogland, to name a few. We also thank our incredible user community for not being afraid to ask questions.
Do you have questions?
Ready to start exploring YugabyteDB features?
You have some great options: run the database locally on your laptop (Quick Start), deploy it to your favorite cloud provider (Multi-node Cluster Deployment), or sign up for a free YugabyteDB Managed cluster. It’s easy! Get started today!