An Introduction to PostgreSQL Table Functions in YugabyteDB
This series on PostgreSQL table functions follows on from my “Using Stored Procedures in Distributed SQL Databases” post. In this series of posts we’ll cover:
- What table functions are and why they’re useful
- Demonstrate the use of some built-in SQL table functions
- Introduce how you can implement a user-defined table function, if no built-in one meets your needs, in the same way that you implement regular stored procedures and functions
Please note that everything that I say in this three-part series applies to both vanilla PostgreSQL and YugabyteDB. All the code examples described have identical outcomes in both environments. There are no special distributed SQL considerations you will need to make.
In the context of programming languages in general, we’re used to the notion that a function can be invoked at any syntax spot where an expression is legal and that it computes a single value that depends on the actual arguments with which it is invoked. The value might be compound, like a record or an array of records—but it is nevertheless a single value.
SQL databases allow a new kind of function. A table function is one that returns a computed row set that depends on the actual arguments with which it is invoked. It can be used, therefore, in the from list of a subquery, in arbitrary combination with tables and views; and a subquery whose from list includes a table function can be used at any syntax spot in a SQL statement where a subquery can be used.
YugabyteDB inherits its set of built-in SQL functions from PostgreSQL. Among these, a few are built-in table functions. It also inherits the capability for implementing user-defined table functions. You do this by using the same constructs with which you implement regular stored procedures and functions.
In this section I’ll describe three built-in SQL table functions.
The most obvious example of a built-in SQL table function is generate_series(). It returns a table with a single column of datatype integer. It provides a useful way to generate test data, as long as the required pattern can be specified using SQL expressions, like this:
create table t(k int primary key, v varchar(20)); insert into t(k, v) select a.v*100, case a.v % 3 when 1 then 't_'||ltrim(to_char(a.v, '09')) when 2 then 't_'||ltrim(to_char(a.v*2, '09')) else 't_42' end from (select generate_series(1, 9) as v) as a;
It generates these rows:
k | v -----+------ 100 | t_01 200 | t_04 300 | t_42 400 | t_04 500 | t_10 600 | t_42 700 | t_07 800 | t_16 900 | t_42
This example is rather artificial. But I used this approach to effect to generate the test data for my “The Benefit of Partial Indexes in Distributed SQL Databases” post.
Another example is unnest(). This generates a table from an array of row values. Here’s an example. It uses the same table t.
insert into t(k, v) select k, v from unnest( array[(1, 'one'), (2, 'two'), (3, 'three')]::t);
Notice the typecast “::t”. Here, “t” is taken to be the type name. Maybe you didn’t realize that tables and types are in different namespaces, so that a table and a type can each be called t. And maybe you didn’t realize that when you create a table, a type with the same name, and with the table’s row shape, is also implicitly created.
I’ll show in the “Compact syntax for bulk insert” use case, in my third post in this series, how, when you use “insert into… select… from unnest(…)” as a static SQL statement in a PL/pgSQL procedure, you can insert a set of arbitrarily programmatically computed rows using a single SQL statement. In this use, the argument of unnest() will be a local variable or formal parameter whose datatype is an array of the to-be-inserted records (or scalar values).
The unnest() built-in is also useful in a subquery’s where clause when the in predicate’s argument is a subquery.
select k, v from t where k in (select unnest from unnest(array[1, 2, 3])) order by k;
I’m including this for completeness. It’s of limited usefulness because it must have an explicit literal tuple for every to-be-inserted row. This means that you can’t use it when you don’t know the number of computed rows until run-time. Having said this, I have seen people describe how to generate the text of an insert statement that uses a run-time-defined number of tuples and then execute it dynamically. This seems to me to be pointless because unnest() lets you implement the intended behaviour using a SQL statement that’s fixed at compile time. Here’s an example, again using the same table t:
insert into t(k, v) select column1 as k, column2 as v from (values(4, 'four'), (5, 'five'), (6, 'six')) as a;
You’re probably more used to seeing the elided form of this syntax:
insert into t(k, v) values(4, 'four'), (5, 'five'), (6, 'six');
A regular language plpgsql user-defined function is implemented using the plain return statement. Its argument can only be a single value. A user-defined table function, in contrast, is implemented using either return next, to return the next explicitly computed row, or return query, to return the entire result set of a select statement. You can define the statement using either static SQL or dynamic SQL. You can also implement a language sql table function by defining the body entirely with a single static SQL select statement. (You can use one of the function’s formal parameters at a syntax spot in the select statement where a value is needed.)
My second post, “Implementing PostgreSQL user-defined table functions in YugabyteDB”, covers this topic in detail.
It turns out that it’s legal to use a regular function in the from list of a SQL statement—even though (by definition) it returns just a single value. Here’s an example that uses a built-in SQL function:
select sqrt from sqrt(4);
Of course, you can do the same with a regular user-defined function. The ability to do this seems to be a consequence of a widely respected programming paradigm:
Don’t write defensive code (and take on its technical debt) to prevent a user doing something pointless, but harmless.
Rather, let the user decide not to do the pointless thing! In this use case (evaluate a SQL function in a client-side program), it’s shorter and clearer to use the bare syntax “select sqrt(4)”:
When you’re doing the high-level implementation design for an application that uses a SQL database (distributed or monolithic), the chances of choosing the best design are maximized when you fully understand the capabilities of each of your major modules (programmable browser, load balancer, application server, connection pool, database, and so on). So, among other things, you need to understand the full range of possibilities for implementing business logic inside the database. This knowledge spans the SQL language, SQL built-in functions, and everything about implementing stored procedures. In particular, therefore, you need to know about table functions. When a use case calls for a table function, alternative implementations are typically cumbersome to program and inefficient to execute.
I encourage you, therefore, to read the other posts in this series. My second post explains how to implement user-defined table functions, in all variants. It uses bare-bones examples that are therefore somewhat unrealistic. My third post discusses four realistic use cases that call for user-defined table functions and explains how each case is met using complete, self-contained code examples.