Using Stored Procedures in Distributed SQL Databases
Note: YugabyteDB uses its own special distributed document store called DocDB. But it provides SQL and stored procedure functionality by using the “upper half” of the standard PostgreSQL source code. This is explained in the two part blog post “Distributed PostgreSQL on a Google Spanner Architecture”: (1) Storage Layer; and (2) Query Layer.
The case for stored procedures
Software engineering’s most famous principle says that large software systems must be built from modules—where a module is an encapsulation of a unit of coherent functionality behind an API that hides all the implementation details from the API’s clients.
Any software system that uses a SQL database is inevitably “large”, and the top-down decomposition will immediately identify that database as a module. The middle tier application server is identified as another (and this is further decomposed into a submodule for connection pooling and a submodule for communicating with the next tier). Load balancing will be implemented in another module. Some systems use more than one database, sometimes in a polyglot fashion—so yet more database modules. And so it goes on.
Given, then, that the SQL database is a module, what are the implementation details that must be hidden? Of course, it’s the tables, the SQL statements that query and change the contents of these, and the ancillary structures like indexes and triggers. And how do you implement a purpose-oriented API? Of course by using a procedural language. The term stored procedure has come into use to denote either a procedure or a function whose definition is held as metadata inside the database and that is compiled and executed in the same database server process that compiles and executes SQL. And the purpose of a sensibly specified stored procedure is to implement exactly one parameterized business transaction, or to deliver exactly one parameterized report.
So software engineering’s most famous principle is specialized, for the use of a SQL database, thus:
The case for stored procedures is as simple as this. And it’s unassailable.
However, not all development shops buy the reasoning that I’ve outlined. Some choose to use the database as no more than a bag of tables (and indexes)—arguing, for example, that they can’t hire developers to program in anything but very new and fashionable languages. In contrast, PL/pgSQL seems to be modeled on Oracle Database’s PL/SQL, and the stored procedure languages of other long-established databases. And languages in this family—imperative procedural languages—in turn, share a huge set of features (like datatypes, null semantics, and built-in SQL functions) from the “ancient” SQL Standard. They are also inspired by “ancient” languages like Ada. The “no stored procedures” brigade sometimes even avoid the use of declarative constraints, like foreign keys, preferring to try to implement their effect in outside-of-the-database application code.
I shall, hereinafter, use the following terms to characterize the two ends of the spectrum of how stored procedures are used:
- The hard shell approach: this is the strict interpretation of the specialization of the principle of modular software construction to a SQL database component within an overall application architecture that uses such a database. Outside-of-the database clients can invoke only the procedures and functions that jointly define the database component’s API.
- The bag of tables approach: this is the rejection of the principle of modular software construction. Every SQL statement that changes or queries any database table is issued by outside-of-the database clients so that the database holds nothing but tables and indexes.
Some shops use a hybrid approach and implement some business functions using stored procedures and others entirely in outside-of-the database code.
I’ve discussed very many overall system designs with very many different architects and programmers of OLTP systems. Every shop that follows the hard shell approach reports success in meeting their goals: correct storage and retrieval of business-critical data; robust security; excellent performance; and highly productive problem diagnosis and system maintainability. But shops that eschew the hard shell approach often report problems in at least one of these areas.
Further benefits of the hard shell approach
It’s worth calling out some additional benefits that flow automatically from adopting the hard shell approach.
The principle of least privilege is well established. Allowing outside-of-the-database clients exactly and only to execute its intended functionality honors this principle. But when such clients can access any table with any SQL, the principle is tautologically violated. Insider security threats are disturbingly common (see, for example, here). The risk is mitigated by limiting the size of the team that has unrestricted access to the database’s contents to the bare minimum. The ordinary benefit brought by modular software construction—different teams, with different specialist domain knowledge, own different modules—therefore also brings a critically important security benefit. Moreover, the within-database objects that implement its API can be owned by subsystem-specific owners so that privilege grants can enforce the same impenetrable encapsulation inside the database as encapsulates the whole database for the outside world. Such fine-grained schemes are not available outside of the database.
An Internet search for “SQL Injection” gets an estimated more than six million hits! And the pieces abound with code examples that show the mechanisms of various notorious exploits. Here is an amusing account, inspired by this famous strip cartoon. Briefly, such exploits are possible only when application code composes the text of SQL statements dynamically by concatenating user-supplied text with boilerplate SQL fragments defined statically in a procedure’s source code. Though, with a thorough understanding of the mechanisms of SQL injection, and with due care, attention, and code reviews, SQL statement text can be dynamically composed safely, mistakes are often made—hence the millions of search hits on the topic. Stored procedure languages in general, and PL/pgSQL in particular, support embedded SQL statements (usually referred to as static SQL) as ordinary procedural statements within the language’s overall syntax and semantic definitions. This means that the objects that a static SQL statement refers to are written as identifiers that are checked when the stored procedure is compiled. If a check fails (for example because the code owner doesn’t have the privilege to access a referenced object), then procedure compilation fails—and you never get to execute the program (which is when the harm would be done). And it means, too, that values that determine, for example, the effect of a where clause are presented as declared variables, or subprogram formal parameters, that, once again, are checked when the procedure is compiled. The net effect of this is that a stored procedure in YugabyteDB that uses only static SQL (and this is the overwhelmingly common case) is simply injection-proof by construction. Popular client-side programming languages (like Java, Python, and so on) don’t support static SQL, and so when such languages issue SQL, they deal with statements that are represented as ordinary text. This means that the SQL injection risk, caused by careless programming, is inevitable.
Exactly because each procedure or function in the set that defines the database’s API is specified to implement one parameterized business function, all of the actual arguments for its invocation are necessarily presented in a single call—in other words by using a single client-server round trip. The implementation of such a business transaction often needs several SQL statements. So, without encapsulating this business logic in a stored procedure, several client-server round trips would be needed. And doing several round-trips obviously takes more time than doing just one.
The on-wire time isn’t the only performance penalty that the hard shell approach cuts. When the procedural code that issues the SQL statements executes in the same database server process that executes them, the fairly elaborate code paths (both on the client side and on the server side) that encode/decode each statement and its intermediate results to/from the wire protocol format are eliminated..
A simple experiment
I conducted a simple experiment using a table created thus:
create table t( k uuid default gen_random_uuid() constraint t_pk primary key, v varchar(100))
UUID stands for “Universally unique identifier” (see this Wikipedia article). And, true to its name, the gen_random_uuid() function populates the table’s surrogate primary key column automatically by returning a different value on each call, no matter on which computer it’s called—just what you need in a multi-node YugabyteDB cluster. The function is exposed by a so-called PostgreSQL extension. It is made available by executing this single statement:
create extension pgcrypto
as explained in the YugabyteDB documentation here. (The blog post “How to: PostgreSQL Fuzzy String Matching In YugabyteDB” describes another interesting use of a PostgreSQL extension.)
I wrote a Python program to execute, and time, multi-statement transactions using a varying number of SQL statements per transaction. I implemented each transaction both using the hard shell approach and using the bag of tables approach.
The bag of tables approach
I modeled the approach that treats the database as a bag of tables by first preparing a single insert statement thus:
yb_session.execute(""" prepare stmt(varchar) as insert into t(v) values($1::varchar) """)
I’ll assume that you understand PostgreSQL’s prepare-execute paradigm for efficient SQL execution. It’s described in the YugabyteDB documentation here and here. I’ll assume, too, that you understand the wisdom of using explicit type casting to avoid the possibly dangerous effects brought by allowing the datatypes of manifest constants to be implicitly converted. Here’s how the SQL standard expreses it:
cast('dog' as varchar)
Here’s the PostgreSQL shorthand for the same type cast:
Then I modeled, in this example, a five statement transaction thus:
yb_session.execute("start transaction") for j in range(0, 5): v = <the next value> yb_session.execute("execute stmt(%s::varchar)", (v,)) yb_session.execute("commit")
Don’t be distracted by the Python syntax!
Rather, read it as pseudocode that specifies the approach that you’d use in any client-side programming language that lets you create a database session and issue SQL statements. Focus instead on the text inside the double quotes. This is standard PostgreSQL. Python uses the %s notation to denote a target for text substitution. And the strange-looking (v,) notation is Python’s locution for specifying a variable number of actual values that will replace the same number of %s substitution sites. So in a different test, you’d write this:
yb_session.execute(""" prepare stmt(varchar) as insert into t(k, v) values($1::uuid, $2::varchar) """)
and then this:
k = <the next k value> v = <the next v value> yb_session.execute("execute stmt(%s::uuid, %s::varchar)", (k, v,))
You could, of course, directly concatenate the text values of the Python variables k and v into the SQL statement that you intend to execute, but by doing this you’d take on the responsibility (and the risk) of making this SQL-injection proof. The mechanism that I show above checks for safety behind the scenes. Contrast this (I found it a bit confusing when I first learned it) with the clarity of the PL/pgSQL locution that has the same effect in the source code of procedure insert_rows() below.
I explored the effect that the number of SQL statements, between “start transaction” and “commit” has on the execution times by running transactions with 1, 5, 10, 20, 50, and 100 statements. To get times that were long enough to measure reproducibly, I timed inserting 10,000 rows. When each transaction inserted, say, 5 rows, the batch had 2,000 transactions; and when each transaction inserted, say, 100 rows, the batch had 100 transactions.
The hard shell approach
I modeled the approach that encapsulates the database functionality behind a stored procedure API by using a procedure that takes an array of varchar values created thus:
create procedure insert_rows(vs in varchar) language plpgsql as $body$ declare this_v varchar; begin foreach this_v in array vs loop insert into t(v) values(this_v); end loop; end; $body$
You can see clearly here that the insert statement is an ordinary PL/pgSQL statement, terminated as are all such statements, by a semicolon—in other words, it’s static SQL and not a text value. Notice that the table t, and the bind argument this_v, are ordinary PL/pgSQL identifiers. Significantly, PL/pgSQL attempts to resolve these first in SQL scope and then, if the attempt fails, in procedure scope. In this code, t is resolved in SQL scope to the intended table and this_v isn’t—so it’s resolved in procedure scope to the intended declared program variable. Notice, too, that the procedure, executing (following the usual convention) in autocommit mode, implicitly issues an initial “start transaction” and a final “commit”.
Notice, finally, that there is no visible code that implements the prepare-execute paradigm. Another benefit of using stored procedures is that this is done for you automatically, when a static SQL statement is first encountered.
Then I modeled, in this example, a five statement transaction thus, using the Python locution to create a multi-element array dynamically:
vs =  for j in range(0, 5): v = <the next value> vs = vs + [v] yb_session.execute("call insert_rows(%s::varchar)", (vs,))
The table below expresses the results using the same philosophy that I used in my previous post, “The Benefit of Partial Indexes in Distributed SQL Databases”. Because the hard shell approach is faster than the bag of tables approach, I divided the time taken for the slower method by the time taken by the faster method to give the speed ratio—doing this for each of the listed number of SQL statements that composed a transaction. For example, a ratio of 1.2 means that the hard shell approach was 1.2x faster than the bag of tables approach.
The performance benefit increases with the number of SQL statements that implement the transaction, as expected.
Distributed SQL considerations
The multi-statement transactions modeled by my experiment will, in the general case, be distributed transactions down at YugabyteDB’s storage layer. The principles are explained in the YugabyteDB documentation in the section “DocDB transactions” and in particular in its subsection “Transactional IO path”. A distributed transaction needs a few round trips between the nodes in the YugabyteDB cluster that jointly manage the replication of data—which replication brings the benefits that characterize a distributed SQL database. Of course, the well-documented benefits of resilience and scalability inevitably come at a cost. In a monolithic SQL database, the client-server round trips are the only round trips. But in a distributed SQL database, there can, for a given multi-statement transaction, be more round trips between the nodes in the cluster than between the client and the cluster node to which it connects. This explains why the speed ratios that my table shows are somewhat smaller than you’d see if you ran my experiment using a vanilla PostgreSQL database.
The case for stored procedures (correct storage and retrieval of business-critical data, robust security, and highly productive problem diagnosis and system maintainability), that I presented in this post’s first section, is unassailable when the application uses a distributed SQL database, just as it is when it uses a monolithic SQL database. However, the perfomance case is not quite as strong for a distributed SQL database as it is for a monolithic SQL database. Notice though, that for more challenging data manipulation requirements, hand-written SQL seems always to be more efficient than robotically constructed SQL—and robotically constructed SQL often comes hand in hand with middle tier frameworks that encourage the bag of tables approach.
Of course, to see the benefits of stored procedures, you have to use a distributed SQL database that supports them! YugabyteDB supports PostgreSQL’s PL/pgSQL. But I have yet to hear of another distributed SQL database that supports any kind of stored procedures. Certainly, neither Google Spanner, nor CockroachDB, nor TiDB support them. Amazon Aurora has both PostgreSQL and MySQL variants—and so does support stored procedures—and it does offer several of the benefits that a truly distributed SQL database brings. However, its lack of true horizontal scalability and geo-distribution can be major issues for applications that want the ability to scale or geo-distribute writes on-demand.