Start Now

Distributed SQL Tips and Tricks – July 2022

Marko Rajcevic

Welcome back to our distributed SQL tips and tricks blog, where I have the pleasure of recapping distributed SQL questions from around the Internet.

This blog series would not be possible without the support of fellow Yugabeings, Denis Magda, Dorian Hoxha, Franck Pachot, and Frits Hoogland, to name a few. Thanks also to our incredible user community, who are not afraid to ask tough questions.

Do you have a question?

If so, ask on the YugabyteDB Slack channel, Forum, GitHub, or Stack Overflow. For previous distributed SQL tips and tricks posts, check out our archives.

Now, let’s dive in!

When Should I Use Range Sharding Instead of Hash Sharding?

As mentioned in the question, YugabyteDB supports two methods of sharding data: by hash and by range. The sharding method is selected when creating a table or index by setting your PRIMARY KEY.

By default, your primary key will be sharded via HASH unless you call out ASC or DSC based on whether you want to range partition in ascending or descending order. If you are coming from PostgreSQL, note this as PG indexes are B-Tree, logically ordered by range.

Single-row selection using the primary key index

Since YugabyteDB is an OLTP database, it runs most effectively on queries that allow single-row selection using the primary key index. These are the types of workloads hash sharding is most effective for. Here, it helps avoid hotspots since the hashing allows the access to be evenly spread over the tablets.

If you run a range scan on a hash sharded table or index, you will see that it is less effective. In this case, for queries selecting a range of rows, it will be more efficient to shard by range. But, in cases where you may have a query such as:

SELECT * from test_table WHERE id = 'foo' AND time <= NOW() FOR UPDATE SKIP LOCKED LIMIT 1;

You can create a composite primary key where the first column is sharded by HASH, and a second column is added as the clustering key in ASC or DSC.

In the case above, the primary key can look something like PRIMARY KEY(id, time). Since ASC is the default for the clustering column, the data for this table will be hash sharded, and the rows with the same value for id will be clustered in ascending order by time. This will make the query above much more efficient.

Each method has its pros and cons. If you are having trouble differentiating between sharding and partitioning concepts, I encourage you to check out this great blog by Franck Pachot.

If you are eager to start exploring YugabyteDB, you can test these methods in the examples here.

Is it Necessary to Use the Primary Key in All Queries, and How Will it Find the Right Tablet if I Do Not Use the Primary Key?

Any table created in YugabyteDB is sharded by a key regardless if it is defined by the user. The same applies to one assigned by the system, so a user does not create one.

Therefore, for the user defined primary key, YSQL can find the row if you specify the key efficiently.

If you don’t specify it, as mentioned above, YSQL has no other option than to visit every row to validate whether it is the correct one based on the specified filters. This reading of all rows is called a ‘seq scan,’ which you will find with the explain command. This reveals the execution plan.

You see the same behavior if you have a user defined key but do not use it in the query.

YSQL will scan all of the data across all of your nodes, incurring heavy latencies. As a result, we always encourage users to create a primary key that will be used in their queries. In the cases where you have access to other columns on your table, you can create an index on these column(s) to speed up the performance of that query. You can also use the include keyword for a query that will add the data, but it won’t use it for the index ordering.

When Should I Set up Kafka Versus Using xCluster Async Replication?

xCluster is a built-in replication feature supported at the storage layer. It enables asynchronous replication between two independent YugabyteDB clusters without incurring additional latency on writes.

This feature replicates transaction logs (WALs) in a timeline order of updates and provides the at-least-one semantic (additional features and limitations can be found here).

If you need to replicate specific tables across two regions, you should consider this method as it is a built-in YugabyteDB feature.

Apache Kafka is an extra component in your architecture best suited to stream or replicate changes between YugabyteDB and third-party software. However, YugabyteDB supports change-data-capture (CDC) natively with the use of the Debezium Connector (which can be used for most use cases).

New Documentation, Blogs, and Tutorials

You can find all the latest YugabyteDB blogs here, as well as our DEV Community Blogs here.

New Videos

Upcoming Events and Training

Next Steps

Thank you for reading this month’s distributed SQL tips and tricks blog.

Are you ready to start exploring YugabyteDB features?

You have some great options:

It’s easy! Get started today!

Related Posts

Marko Rajcevic

Related Posts

Get started in any cloud, container or data center