How To Design Distributed Indexes for Optimal Query Performance

March 7, 2023

The primary purpose of indexes in a SQL database is to speed up the lookup of data. There is a huge amount of information available, detailing how indexes work on single-node databases like MySQL, Oracle, SQLServer. But, translating that knowledge to a distributed database requires some understanding of how distributed databases work.

In a distributed database, data is split into multiple tablets (NOTE: “tablet” is a YugabyteDB term for “shard”) which reside on different nodes. But it is not just tables; indexes are also split into tablets and are distributed across multiple nodes. In this blog, let’s examine how to optimally design indexes to get the best query performance from the distributed nature of indexes.

Structure of an Index

A Create Index statement has three components—partition, clustering, and include—and is structured as:

CREATE INDEX 
 idx_name ON table_name ( (PARTITION cols), CLUSTERING cols) INCLUDE (cols)

PARTITION – decides how rows in the index are distributed.
CLUSTERING – decides how rows with the same partition column values are ordered.
INCLUDE – additional columns to include in the index to avoid a round-trip to the main table.

Sample Dataset

To better understand the importance of the different parts of the index, let’s consider a census table that maintains the id, name, age, and other data points of the citizens in various zipcodes.

CREATE TABLE census(
    id int,
    name varchar(255),
    age int,
    zipcode int,
    employed boolean,
    PRIMARY KEY(id)
);


----------------------------------------

   id   |   name   | age | zipcode  | employed
--------+----------+-----+----------+----------
 110359 | Chelsea  |  81 |   94356  | t
 192735 | Steven   |   8 |   94048  | t
 219128 | Mary     |  43 |   94095  | t
 493695 | Michael  |   5 |   94085  | f
 310517 | Kimberly |  34 |   94072  | f
 593962 | Corey    |  67 |   94082  | t
 627995 | Anthony  |  58 |   94191  | f
 651891 | Theresa  |  35 |   94230  | t
 669921 | Morgan   |  73 |   94337  | f
 790562 | Maria    |  31 |   94117  | f

Basic Index

Let’s fetch the id of all citizens with name=‘Michael’ in zipcode=94085 and look at the query plan.

select id from census where zipcode=94085 and name='Michael';
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Seq Scan on public.census (actual time=11.179..3409.528 rows=2090 loops=1)
   Output: id
   Filter: ((census.zipcode = 94085) AND ((census.name)::text = 'Michael'::text))
   Rows Removed by Filter: 997910
Execution Time: 3409.746 ms

The Seq Scan on public.census in the output above tells us that a sequential scan is being done. You will notice that the executor did not know where to quickly find the data, so it did a full table scan.

The other thing we notice is [Rows Removed by Filter: 997910]; this indicates that the executor went through all the rows and filtered out most of them. These are indications that we need an index. As we are doing a direct match on 2 columns, we need both of them in our index.

Equality, Hot Shard

Let’s create a basic index first.

CREATE INDEX idx_zip_name on census(zipcode HASH, name ASC);

We have designed a good index that speeds up our query and avoids a full table scan. But let’s dig a little deeper. The partition part zipcode HASH specifies that the index will be distributed on the hash of the zipcode.

So, all data in a specific zipcode will be in the same tablet. Within a specific zipcode, all the names will be in ASC order since name is in the clustering portion of the index. The executor can do a point lookup on the zipcode and quickly locate the matching name since it is in ASC order.

But, let’s say a specific zipcode is highly populated or a lot of selects are being run against it. In this case, the tablet containing that zipcode would become overloaded (typically called a hot tablet). In this scenario, we can do better by distributing the index on both the zipcode and the name:

CREATE INDEX idx_zip_name on census((zipcode, name) HASH);

This ensures that a particular zipcode is distributed across different tablets, but at the same time, will keep data for a specific zipcode and specific name, within a single tablet. This will help avoid hot tablet issues.

TIP: Partition your index on all the columns with equality operation in your query. This will help restrict your query execution to a minimal number of tablets!

Range, Order By

Now let’s fetch the id of kids under the age of 13 with name=‘Michael’ in zipcode 94085.

select id from census where zipcode=94085 and name='Michael' and age < 13;

We definitely need to have age in the index. But we cannot add this column to the partition part as it is not an equality-based point lookup but an order-based range scan. This is where the clustering part comes into play as it enforces ordering. Let’s redefine the index.

CREATE INDEX idx_zip_name on census((zipcode, name) HASH, age ASC);

Because of the sorted nature of the clustering part, it would be the right place to add columns on which you are ordering your results. This means, the above index will also help ORDER BY queries like:

select id from census where zipcode=94085 and name='Michael' order by age;

It is always advisable to keep the ordering in the index the same as the ordering in the query so that the executor need not reorder the resultset. In some cases, the reverse-scan optimization in YugabyteDB kicks in to mitigate the side effects.

Inequality

The partition keys are helpful for point lookups where we know the exact value of the key we are looking for. The clustering keys help us in range scans. But, what about the values that we don’t want? If we do an in-equality on the partition key, it would be a full table scan. If the key is in the clustered part, the search tree would be smaller.

So an index like:

CREATE INDEX idx_zip_name on census((zipcode, name) HASH, age ASC);

would also help in in-equality comparisons like:

select id from census where zipcode=94085 and name='Michael' and age != 13;

TIP: When you do range comparison [<,>], order by or an inequality check [!=] on a column, add that column to the CLUSTERING part of the index!

Index-Only Scan

By default, the purpose of an index is to help the query executor quickly identify the rows matching the query. The query executor then fetches any related data from the main table as needed. For example, in our above queries, we have been fetching the id as the result. The columns zipcode, name, and age are in the index. But id is in the main table.

So, after looking up the index, the executor needs to go to the main table to fetch the id of the citizens. In a large cluster, the index could be on a different node than the actual table. This will increase query response times.

The extra hop (or trip) to the main table can be avoided by adding the needed columns in the INCLUDE part.

CREATE INDEX idx_zip_name on census((zipcode, name) HASH, age ASC) INCLUDE(id);

As id is present right in the index, the query executor would be able to return all needed columns directly from the index with an Index-Only Scan.

TIP: Add the columns you are selecting which are not in the clustering or the partitioning part of the index to the INCLUDE part!

NOTE: Including additional columns using the INCLUDE clause (aka covering indexes) trades space for performance because these INCLUDEd columns are duplicated in the index and main table.

Cardinality of Columns

When we design an index, we have to ensure the data is distributed across multiple nodes. For this, we should avoid partitioning on low cardinality columns like boolean(T/F), and days of week (Monday, Tuesday, etc). Consider this query when we want to get a list of all employed citizens within a zipcode.

select * from census where employed and zipcode = 94085;

And consider this index.

CREATE INDEX idx_employed on census(employed, zipcode);

This looks right at first sight, but is it? The column employed is in the partition part. The whole index would be distributed on hash(True) and hash(False), which would result in all the data being in a maximum of 2 tablets. The right index for that query would be

CREATE INDEX idx_employed on census(zipcode, employed);

Now, the index data would be distributed based on zipcode, which ranges from 1-99999. So, the data would automatically be distributed across multiple tablets.

Also, having the high cardinality items earlier in the index definition will help shorten the lookup tree. For example in the above 2 index forms, a simple index layout would be

False, 90001         90001, False
False, 90002         90001, True
. . .                90002, False
False, 94085         90002, True
. . .                . . .
True, 90001          94084, False
True, 90002          94084, True
. . .                94085, False  
True, 94085          94085, True

In the former case, the lookup would be first done on True/False and then would search through all True items to locate 94085. In the latter case, the lookup will go directly to 94085 and then search for True just within a few items.

Partial Index

When your queries are limited to specific patterns, instead of indexing all the rows, we could index just a subset of data. This would be of great benefit during writes and also improve read performance. For example, let’s consider the query:

select * from census where employed and zipcode = 94085;

Here, we are always looking for employed=True for various zipcodes. For all practical purposes(depending on your application), we don’t need to have employed=False in our index. So, an effective partial index would be:

CREATE INDEX idx_employed on census(zipcode, employed) WHERE employed;

Now, the index will only contain rows with employed=True. The rows with employed=False will not be added to the index. The index would be much smaller and the lookups faster.

90001, True
90002, True
. . .
94084, True
94085, True

Conclusion

Indexes are critical to query performance. Designing indexes for distributed databases can take a little more time and effort compared to traditional databases. But, once you understand the significance of different parts of the index, speeding up your queries is simple!

How To Design Distributed Indexes for Optimal Query Performance

Structure of an Index

Sample Dataset

Basic Index

Equality, Hot Shard

Range, Order By

Inequality

Index-Only Scan

Cardinality of Columns

Partial Index

Conclusion

Further Reading

Explore Distributed SQL and YugabyteDB in Depth

How To Design Distributed Indexes for Optimal Query Performance

Related Posts

Explore Distributed SQL and YugabyteDB in Depth