Indexing for LIKE Queries in PostgreSQL

PostgreSQL Tips & Tricks

June 5, 2023

With the LIKE operator, you can compare a string to a pattern that includes some wildcards and then index for these access patterns. Let’s take a look to see how, for situations where the wildcard is either in front or at the end of the pattern.

I’ll use a simple dataset, the Linux dictionary of English words installed in the dict package:

yugabyte=# \! sudo dnf install -y words

yugabyte=# create table words (
            id bigint generated always as identity primary key
            , word text not null
            );

CREATE TABLE

yugabyte=# \copy words(word) from '/usr/share/dict/words'
COPY 479828

Please note that I will run the demos on YugabyteDB, which is PostgreSQL-compatible. The examples provided here work the same on the PostgreSQL database. Only some details about the storage implementation are different in the execution plan.

LIKE ‘prefix%’ with range index

If you know the prefix, indexing for LIKE is easy because PostgreSQL can transform it to a range scan; an ascending or descending index provides fast and scalable access.

yugabyte=# create index words_word on words ( word asc );
CREATE INDEX

If using YugabyteDB:
In PostgreSQL, the asc option is not mandatory because it’s the default. YugabyteDB re-uses PostgreSQL for the query layer, with additional sharding options to distribute and replicate for scalability and high availability. The default sharding method with YugabyteDB applies a hash function on the first column of the index which allows YugabyteDB to distribute and provide scalable point queries. However, for range queries, the sharding must be done by range of value, which must be defined with an explicit ASC or DESC because hash sharding modifies the order and a Seq Scan would be the only solution.

yugabyte=# explain (analyze, costs off, summary off)
           select word from words where word like 'data%';

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Only Scan using words_word on words (actual time=6.632..6.643 rows=28 loops=1)
   Index Cond: ((word >= 'data'::text) AND (word < 'datb'::text))
   Remote Filter: (word ~~ 'data%'::text)
   Heap Fetches: 0
(7 rows)

The filter (word ~~ 'data%'::text) is the predicate in the SQL query (~~ is the internal PostgreSQL operator that is equivalent to LIKE). The query planner has added an index condition for the range derived from the pattern: the text that starts with ‘data’ is, in the index collation order, between ‘data’ and ‘datb’.

Note: PostgreSQL11 introduced the ^@ operator that is similar to the starts_with() function and can be used for similar queries:

postgres=# explain (analyze, costs off, summary off)
           /*+ Set(enable_seqscan off) */
           select word from words where word ^@ 'data';

                            QUERY PLAN
-----------------------------------------------------------------
 Index Only Scan using words_word on words (actual time=848.656..848.668 rows=28 loops=1)
   Remote Filter: (word ^@ 'data'::text)
   Heap Fetches: 0
(6 rows)

This operator was introduced in PG11 for use by SP-GiST.It is not chosen by the query planner, except when disabling Seq Scan. In PG15 the query planner can use it with range indexes (ex: https://dbfiddle.uk/OMWqGYDF)

ILIKE ‘prefix%’ with expression index

With ILIKE, or its equivalent ~~*, the comparison is case insensitive. It cannot be served by a range scan, because upper and lower characters are stored at different places in the index:

yugabyte=# explain (analyze, costs off, summary off)
           select word from words where word ilike 'data%';

                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on words (actual time=611.787..611.794 rows=29 loops=1)
   Remote Filter: (word ~~* 'data%'::text)
(5 rows)

An exception occurs when the prefix begins with non-alphabetical characters which are not case sensitive. So for example, searching for the prefix ‘-ph’ allows using the index only for the first character, which lies between ‘-‘ and its ASCII successor ‘.’.

yugabyte=# explain (analyze, costs off, summary off)
           select word from words where word ilike '-ph%';

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Only Scan using words_word on words (actual time=1.564..1.573 rows=26 loops=1)
   Index Cond: ((word >= '-'::text) AND (word < '.'::text))
   Remote Filter: (word ~~* '-ph%'::text)
   Heap Fetches: 0
(7 rows)p

For this, don’t use ILIKE but a simple comparison with all upper case (or all lower case) and, of course, create an expression index for it

yugabyte=# create index words_upper 
           on words ( (upper(word)) asc );
CREATE INDEX

yugabyte=# explain (analyze, costs off, summary off)
           select word from words 
           where upper(word) like upper('-ph%');

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Scan using words_upper on words (actual time=6.181..6.194 rows=26 loops=1)
   Index Cond: ((upper(word) >= '-PH'::text) AND (upper(word) < '-PI'::text))
   Remote Filter: (upper(word) ~~ '-PH%'::text)
(6 rows)

For those who follow this PostgreSQL Tips and Tricks series, you know that I like to create a view to add the expressions as virtual columns, to be sure that I use the same expression in the query and in the index:

yugabyte=# create view words_view as select id, word
           , upper(word) as iword from words;
CREATE VIEW

yugabyte=# explain (analyze, costs off, summary off)
           select word from words_view where iword like '-ph%';

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Scan using words_upper on words (actual time=0.485..0.485 rows=0 loops=1)
   Index Cond: ((upper(word) >= '-ph'::text) AND (upper(word) < '-pi'::text))
   Remote Filter: (upper(word) ~~ '-ph%'::text)
(6 rows)

LIKE ‘%suffix’ with reverse index

If you know the suffix but not the prefix, it’s not possible to do a range scan because the interesting rows are scattered in the index. This is where you can leverage an interesting function: reverse() which returns the characters in reverse order. This function preserves the equivalence, so that a condition such as string1=string2 can be replaced by reverse(string1)=reverse(string2). We can then transform a suffix pattern to a prefix pattern by applying the function on both sides of the equality and create an index for it.

yugabyte=# create index words_reverse 
           on words ( reverse(word) asc );
CREATE INDEX

yugabyte=# explain (analyze, costs off, summary off)
           select word from words 
           where reverse(word) like reverse('%ism');

                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Index Scan using words_reverse on words (actual time=8.429..34.178 rows=4547 loops=1)
   Index Cond: ((reverse(word) >= 'msi'::text) AND (reverse(word) < 'msj'::text))
   Remote Filter: (reverse(word) ~~ 'msi%'::text)
(6 rows)

Note that if you have ‘orafce’ extension installed, a similar function plvstr.rvrs() is faster than reverse() and will provide the same result. However, be sure to be consistent with the expression in the index and in the queries.

C collation or text_pattern_ops

I have run those examples in a database with the C collation sorting. Comparisons are based on ASCII, the binary encoding (UTF8 here):

yugabyte=# \l
                                    List of databases
       Name        |  Owner   | Encoding | Collate |    Ctype    |
-------------------+----------+----------+---------+-------------+
 postgres          | postgres | UTF8     | C       | en_US.UTF-8 |
 system_platform   | postgres | UTF8     | C       | en_US.UTF-8 |
 template0         | postgres | UTF8     | C       | en_US.UTF-8 |
                   |          |          |         |             |
 template1         | postgres | UTF8     | C       | en_US.UTF-8 |
                   |          |          |         |             |
 yugabyte          | postgres | UTF8     | C       | en_US.UTF-8 |
(6 rows)

With C collation, the indexes store the text in order of their encoding and compare it character by character. This is why a prefix pattern can be translated to a range. Today where most databases have international users and are not specific to one country, the C collation widely used. However, if that’s not the case, you can get the same index range scans if you create the index with a special operator, text_pattern_ops to allow character by character comparison rather than with the local rule.

yugabyte=# create index words_word
           on words ( word text_pattern_ops asc);
CREATE INDEX

LIKE ‘%pattern%’ with GIN index on trigrams

There is a general solution to all of the previous queries —solve by using a single GIN index with the PG_TRGM extension which indexes all trigrams in a text:

yugabyte=# create extension if not exists pg_trgm;
CREATE EXTENSION

yugabyte=# create index words_word_trgm on words
           using gin ( word gin_trgm_ops);
CREATE INDEX

This index is more expensive to build, maintain, and query. Instead of having one index entry for text column value, it indexes each set of three consecutive characters (trigrams) to be able to look up any pattern. This overhead pays back by allowing all kinds of full text searches, including all previous examples. It is also used with LIKE and ILIKE conditions without the need to use functions such as upper() or reverse().

Here are the execution plans for the previous queries with this index (Note: I’ve dropped the other indexes):

yugabyte=# explain (analyze, costs off, summary off)
           select word from words where word like 'data%';

                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Index Scan using words_word_trgm on words (actual time=11.328..187.294 rows=28 loops=1)
   Index Cond: (word ~~ 'data%'::text)
   Rows Removed by Index Recheck: 25426
(6 rows)

yugabyte=# explain (analyze, costs off, summary off)
           select word from words where word ilike '-ph%';

                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Index Scan using words_word_trgm on words (actual time=10.752..323.162 rows=26 loops=1)
   Index Cond: (word ~~* '-ph%'::text)
   Rows Removed by Index Recheck: 44598
(6 rows)

yugabyte=# explain (analyze on, costs off, summary off)
           select word from words
           where word like '%ism';

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using words_word_trgm on words (actual time=11.306..46.308 rows=4547 loops=1)
   Index Cond: (word ~~ '%ism'::text)
   Rows Removed by Index Recheck: 901
(6 rows)

yugabyte=# explain (analyze on, costs off, summary off)
           select word from words
           where word like '%byte%';

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Scan using words_word_trgm on words (actual time=4.418..4.449 rows=54 loops=1)
   Index Cond: (word ~~ '%byte%'::text)
   Rows Removed by Index Recheck: 18
(6 rows)

This GIN index can also be used efficiently in PostgreSQL when combining prefixes and suffixes because it uses a bitmap index scan, and bitmaps can be combined:

postgres=> explain (analyze, costs off, summary off)
           select word from words where word like 're%olution';

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Bitmap Heap Scan on words (actual time=0.872..0.879 rows=6 loops=1)
   Recheck Cond: (word ~~ 're%olution'::text)
   Rows Removed by Index Recheck: 3
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on words_word_trgm (actual time=0.854..0.854 rows=9 loops=1)
         Index Cond: (word ~~ 're%olution'::text)
(8 rows)

There are only three false positives to discard after the index scan.

Bitmap scan works on table pages. Because YugabyteDB distributes rows and not pages, it doesn’t do any bitmap scans. It can show more false positive (There’s still work in progress to optimize it—see Github issue #7850):

yugabyte=> explain (analyze, costs off, summary off)
           select word from words where word like 're%olution';

                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Index Scan using words_word_trgm on words (actual time=12.604..119.505 rows=56 loops=1)
   Index Cond: (word ~~ '%olution'::text)
   Rows Removed by Index Recheck: 15647
(6 rows)

A GIN index on trigrams is useful for generic text search, but if all you need is a prefix, a direct index will be more efficient.

Performance and Scalability

To evaluate performance and scalability, I prioritize running small reproducible examples and analyzing the execution plan rather than relying solely on response time. Response time can fluctuate due to factors like query patterns, cached data, concurrent workload, and database system configuration.

For instance, when working with PostgreSQL, or a PostgreSQL-compatible database, the response time may be influenced by vacuuming, the utilization of a GIN index with bitmap scan, and the number of Rows Removed by Index Recheck. Performance may vary if you utilize a different storage system, like YugabyteDB, that doesn’t necessitate vacuuming or employ bitmap scan.

Understanding the time complexity and scalability of the execution plan is crucial. The mentioned indexes are designed to prevent full table scans and contribute to improved performance so that the response time isn’t directly tied to the total number of rows in the table. However, if the filter returns most of the table’s row, a full table scan (Seq Scan) is probably the most efficient.

In a YugabyteDB Friday Tech Talk (YFTT), I discussed these techniques and highlighted the significance of range sharding for distributed SQL databases. Watch the short presentation here:

Additional PostgreSQL Tips and Tricks

Check out our library of PostgresQL tips and tricks.

June 5, 2023

PostgreSQL Tips and Tricks

Indexing for LIKE Queries in PostgreSQL

Related Posts

Explore Distributed SQL and YugabyteDB in Depth