Distributed SQL Tips and Tricks – July 2022
Welcome back to our distributed SQL tips and tricks blog series, where I have the pleasure of recapping distributed SQL questions from around the Internet.
This blog series would not be possible without the support of fellow Yugabeings, Denis Magda, Dorian Hoxha, Franck Pachot, and Frits Hoogland, to name a few. Thanks also to our incredible user community, who are not afraid to ask tough questions.
Do you have a question?
If so, ask on the YugabyteDB Slack channel, Forum, GitHub, or Stack Overflow. For previous distributed SQL tips and tricks posts, check out our archives.
Now, let’s dive in!
When Should I Use Range Sharding Instead of Hash Sharding?
As mentioned in the question, YugabyteDB supports two methods of sharding data: by hash and by range. The sharding method is selected when creating a table or index by setting your PRIMARY KEY.
By default, your primary key will be sharded via HASH unless you call out ASC or DSC based on whether you want to range partition in ascending or descending order. If you are coming from PostgreSQL, note this as PG indexes are B-Tree, logically ordered by range.
Single-row selection using the primary key index
Since YugabyteDB is an OLTP database, it runs most effectively on queries that allow single-row selection using the primary key index. These are the types of workloads hash sharding is most effective for. Here, it helps avoid hotspots since the hashing allows the access to be evenly spread over the tablets.
If you run a range scan on a hash sharded table or index, you will see that it is less effective. In this case, for queries selecting a range of rows, it will be more efficient to shard by range. But, in cases where you may have a query such as:
SELECT * from test_table WHERE id = 'foo' AND time <= NOW() FOR UPDATE SKIP LOCKED LIMIT 1;
You can create a composite primary key where the first column is sharded by HASH, and a second column is added as the clustering key in ASC or DSC.
In the case above, the primary key can look something like PRIMARY KEY(id, time). Since ASC is the default for the clustering column, the data for this table will be hash sharded, and the rows with the same value for id will be clustered in ascending order by time. This will make the query above much more efficient.
Each method has its pros and cons. If you are having trouble differentiating between sharding and partitioning concepts, I encourage you to check out this great blog by Franck Pachot.
If you are eager to start exploring YugabyteDB, you can test these methods in the examples here.
Is it Necessary to Use the Primary Key in All Queries, and How Will it Find the Right Tablet if I Do Not Use the Primary Key?
Any table created in YugabyteDB is sharded by a key regardless if it is defined by the user. The same applies to one assigned by the system, so a user does not create one.
Therefore, for the user defined primary key, YSQL can find the row if you specify the key efficiently.
If you don’t specify it, as mentioned above, YSQL has no other option than to visit every row to validate whether it is the correct one based on the specified filters. This reading of all rows is called a ‘seq scan,’ which you will find with the explain command. This reveals the execution plan.
You see the same behavior if you have a user defined key but do not use it in the query.
YSQL will scan all of the data across all of your nodes, incurring heavy latencies. As a result, we always encourage users to create a primary key that will be used in their queries. In the cases where you have access to other columns on your table, you can create an index on these column(s) to speed up the performance of that query. You can also use the include keyword for a query that will add the data, but it won’t use it for the index ordering.
When Should I Set up Kafka Versus Using xCluster Async Replication?
xCluster is a built-in replication feature supported at the storage layer. It enables asynchronous replication between two independent YugabyteDB clusters without incurring additional latency on writes.
This feature replicates transaction logs (WALs) in a timeline order of updates and provides the at-least-one semantic (additional features and limitations can be found here).
If you need to replicate specific tables across two regions, you should consider this method as it is a built-in YugabyteDB feature.
Apache Kafka is an extra component in your architecture best suited to stream or replicate changes between YugabyteDB and third-party software. However, YugabyteDB supports change-data-capture (CDC) natively with the use of the Debezium Connector (which can be used for most use cases).
New Documentation, Blogs, and Tutorials
You can find all the latest YugabyteDB blogs here, as well as our DEV Community Blogs here.
- Java Development: How Applications Litter Beyond the Heap
- Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB
- Three NoSQL Challenges That Can Be Solved with Distributed SQL
- Distributed SQL Summit 2022: Become a Data-Driven Organization at Scale
- YugabyteDB 2.15: Accelerate Cloud Native Adoption with YugabyteDB Voyager
- YugabyteDB 2.15: Discover Worry-Free Performance
- YugabyteDB 2.15: Enable Developers to Build Quicker and Ship Faster
- YugabyteDB 2.15: Support Any Workload with Dynamic Workload Optimization
- Announcing YugabyteDB 2.15 and YugabyteDB Voyager: The Future of Distributed SQL
- How Narvar Embraces Flexibility for Compliance, Multi-Cloud, and More
- Episode 1: Creating First YugabyteDB Managed Cluster
- Episode 2: Exploring Distributed SQL with YugabyteDB Managed
- Episode 3: Creating Sample Application for YugabyteDB Managed
- YugabyteDB Friday Tech Talks: Episode 17: xCluster Replication
- YugabyteDB Demo: High Availability and Fault Tolerance
- YugabyteDB Demo: Async Replication Between Several Distant Clusters
- YugabyteDB Demo: Backup and Restore
- YugabyteDB Friday Tech Talks: Episode 18: Using the Orafce Extension in YugabyteDB
- Evaluating PostgreSQL Compatibility, Episode 1: Introduction
- Evaluating PostgreSQL Compatibility, Episode 5: Runtime Compatibility
- YugabyteDB Friday Tech Talks: Episode 22: Change Data Capture in YugabyteDB
- YugabyteDB Friday Tech Talks: Episode 23: Transaction Isolation Levels
- Why YugabyteDB? Ep. #1, A New Startup is Born
- YugabyteDB Friday Tech Talks: Episode 24: Partitioning Tables in YugabyteDB
- Why YugabyteDB? Ep. #2, Quick Launch in Cloud
- Introducing YugabyteDB 2.15 and YugabyteDB Voyager
- YugabyteDB Friday Tech Talks: Episode 25: Optimistic & Pessimistic Locking in YugabyteDB
- YugabyteDB Friday Tech Talks: Episode 26: Managing Your YugabyteDB Cluster with yugabyted
- Distributed SQL Summit 2021 – all videos!
Upcoming Events and Training
- YugabyteDB Friday Tech Talks: Are Stored Procedures a Good Thing?
- Common Database Misconceptions: Distributed SQL vs. NewSQL
- YugabyteDB DBA Fundamentals
- Hack MidWest
- Build a Real-time Polling App with Hasura GraphQL and YugabyteDB Managed
- AWS Summit Chicago
- Distributed SQL Summit 2022 Virtual Event
- Cloud Expo Asia, Singapore
- KubeCon + CloudNativeCon North America 2022
- CIO Visions Summit, Chicago
Thank you for reading this month’s distributed SQL tips and tricks blog.
Are you ready to start exploring YugabyteDB features?
You have some great options:
- run the database locally on your laptop (Quick Start)
- deploy it to your favorite cloud provider (Multi-node Cluster Deployment)
- sign up for a free YugabyteDB Managed cluster
It’s easy! Get started today!