Distributed SQL Tips and Tricks – May 8, 2020

May 8, 2020

Welcome to this week’s tips and tricks blog where we recap some distributed SQL questions from around the Internet. We’ll also review upcoming events, new documentation, and blogs that have been published since the last post. Got questions? Make sure to ask them on our YugabyteDB Slack channel, Forum, GitHub, or Stackoverflow. Ok, let’s dive right in:

How can I connect YugabyteDB with Apache Spark?

There are two ways to connect YugabyteDB to Apache Spark, depending on which distributed SQL API you are using. For YCQL we provide a Spark connector library and for YSQL you can use Apache Spark’s JDBC API.

Why is an external load balancer recommended for the YSQL API but not for the YCQL API?

Yugabyte’s two distributed SQL APIs have different needs for load balancers because of the underlying differences in their client driver implementations. YSQL relies on PostgreSQL client drivers that are neither cluster aware nor shard aware. In other words, they are not yet “smart”. On the other hand, YCQL relies on Apache Cassandra Query Language’s smart client drivers that are both cluster aware (and hence do not require a load balancer) and shard aware. Let’s dive deeper to understand what cluster awareness and shard awareness actually mean.

In case of YSQL, most of PostgreSQL’s client drivers take only one YugabyteDB node’s IP as input and will always route requests to that node. While there is no functional issue here since all YugabyteDB nodes are equal and can accept client requests for all the data in the cluster (they will internally route the requests, collect the results, and respond to the client with the results), a load balancer simply abstracts away the cluster of nodes into a single IP so that the client driver can send requests to all the nodes in the cluster. This is different from how YCQL client drivers work. The input to a Cassandra QL client driver is typically a list of IPs of the nodes in the cluster. Even if you give only one node’s IP, the rest of the nodes of the cluster are auto discovered by the client driver after it connects to the cluster for the first time. Thereafter cluster configuration changes such as add/remove nodes are also propagated back to all active clients. Because of this built-in cluster awareness, there is never a primary node in the world of YCQL.

Coming to shard awareness, even though all nodes receive the YSQL requests using a load balancer, there can be a lot of requests that reach nodes that do not have the necessary data since the PostgreSQL driver is not shard aware. Again there is no functional issue but a performance issue, which any distributed RDBMS faces, given additional hops between the nodes to query the data. In case of YCQL, along with an updated list of nodes, the client driver also caches the shard metadata which has the mapping of which shard/tablet is located on which node. As a result, the client driver is able to route queries to the relevant nodes directly without the need for any additional hops.

How can I order returned rows by the `IN` list of values?

Sometimes you want to return the rows ordered as specified in a supplied list. This can be easily done like in the example below:

CREATE TABLE books(name TEXT, publication_year TIMESTAMP, PRIMARY KEY(name ASC));

INSERT INTO books values('Pride and Prejudice', '01-28-1813');
INSERT INTO books values('The Great Gatsby', '04-10-1925');
INSERT INTO books values('The Catcher in the Rye', '07-16-1951');
INSERT INTO books values('Little Women', '09-30-1868');
INSERT INTO books values('The Adventures of Huckleberry Finn', '12-10-1884');

Now, say that you want to search for 3 books and you do this using the IN clause.

SELECT * FROM books WHERE name IN ('The Adventures of Huckleberry Finn', 'Pride and Prejudice', 'The Great Gatsby');

                name                |  publication_year   
------------------------------------+---------------------
 Pride and Prejudice                | 1813-01-28 00:00:00
 The Adventures of Huckleberry Finn | 1884-12-10 00:00:00
 The Great Gatsby                   | 1925-04-10 00:00:00
(3 rows)

As you can see, the rows are returned in order of name and not in the order in which they appear in the IN clause.

If you want to retrieve rows and order them based on their order in the IN clause, you can make use of the VALUES clause as shown below:

select b.*                                                                                                                                                         from books b                                                                                                                                                                  join (                                                                                                                                                                          values                                                                                                                                                                          ('The Adventures of Huckleberry Finn',1),                                                                                                                                     ('Pride and Prejudice',2),                                                                                                                                                    ('The Great Gatsby',3)                                                                                                                                                    ) as x (name, sortorder) on b.name = x.name                                                                                                                                   order by x.sortorder;

                name                |  publication_year   
------------------------------------+---------------------
 The Adventures of Huckleberry Finn | 1884-12-10 00:00:00
 Pride and Prejudice                | 1813-01-28 00:00:00
 The Great Gatsby                   | 1925-04-10 00:00:00
(3 rows)

Alternatively, you can also use WITH ORDINALITY:

SELECT b.*                                                                                                                                                         FROM   books b                                                                                                                                                                JOIN   unnest('{"The Adventures of Huckleberry Finn",
"Pride and Prejudice",
"The Great Gatsby"}'::text[])
WITH ORDINALITY t(name, sortorder) USING (name)                    ORDER  BY t.sortorder;

                name                |  publication_year   
------------------------------------+---------------------
 The Adventures of Huckleberry Finn | 1884-12-10 00:00:00
 Pride and Prejudice                | 1813-01-28 00:00:00
 The Great Gatsby                   | 1925-04-10 00:00:00
(3 rows)

How can I create a backup of a single table in a YSQL database?

You can backup databases in YSQL using the ysql_dump tool. And a single table using the --table argument.

Assuming you have a database called yugabyte and a table called helloworld, the full command to backup the table is:

./postgres/bin/ysql_dump --schema public --table books --serializable-deferrable --file “helloworld.dump.txt” yugabyte

What is the query to list all table names in my YugabyteDB distributed SQL database?

Getting a list of all tables is easily done by querying the information_schema.tables view like below:

yugabyte=# SELECT table_name
  FROM information_schema.tables
 WHERE table_schema='public'
   AND table_type='BASE TABLE';
 table_name
------------
 comments
 users
 books
(3 rows)

See the full docs on the information_schema.tables view for additional details.

Synopsis and important topics on array data types
The array[] value constructor
Creating an array value using a literal
Built-in SQL functions and operators for arrays — covers 30 functions and operators for working with arrays, and examples for each of them