The Most PostgreSQL-Compatible Distributed SQL Database is YugabyteDB
This morning we announced the general availability of YugabyteDB 2.11, a major release which extends PostgreSQL compatibility of the open source distributed SQL database! These updates allow application developers to use powerful and familiar PostgreSQL features without compromising resilience, scale, or performance.
One of the biggest roadblocks to database adoption is a lack of familiarity. Developers love PostgreSQL and want to work with the familiar API in order to be productive. This is why many distributed SQL databases now tout compatibility with PostgreSQL. But “kind of compatible” is not enough. These databases cherry-pick features to support leaving developers to figure out how to work around limitations. We heard loud and clear from the YugabyteDB community that true distributed SQL with complete PostgreSQL compatibility is the gold standard. That is what we are delivering.
This release extends YugabyteDB’s lead as the most PostgreSQL-compatible distributed SQL database in the world. YugabyteDB 2.11 brings widely used PostgreSQL features to the database, including Foreign Data Wrappers (FDW), GIN indexes, collation support, follower reads, improved space amplification and read-committed isolation levels for concurrent transactions.
In this post, we get into these features in more detail.
Similar to PostgreSQL, YugabyteDB supports the following three container column types. These three data types are useful in different cases:
- jsonb – Makes it useful for schema-less data use cases
- tsvector – Supports full text search through a sorted list of distinct lexemes
- array – Gives the flexibility to store a large composite variable-length value in a single field
While these column data types are useful, their utility is severely curtailed without an efficient way to index them. Let’s understand why these regular indexes are not sufficient through the following containment search examples.
1. Filtering text columns based on what words they contain: Let’s take an example table from YugabyteDB docs:
CREATE TABLE book (page int PRIMARY KEY, word text);
To retrieve pages 4 to 7, you can easily create a regular index and use this query:
SELECT * FROM book WHERE page >= 4 and page <= 7;
But looking for the specific word ‘foo’ would need a full table scan
SELECT * FROM book WHERE words && ARRAY['foo'];
Creating a regular index won’t help since you still need to search words for foo.
2. Filtering array columns based on what array elements they contain: Let’s take an example of a “students” table. The table consists of name, exam scores as a two dimensional array, and phone number. We could find out which student has a particular contact number efficiently by creating a secondary index on the phone number column. But to find out the number of students who got a 90 on one of the exams would need a full table scan.
3. Similarly, we see JSON use cases where users need to filter based on the existence of top-level keys or primitive values of the JSON column, as well as filtering JSON columns based on primitive values of deeply-nested keys.
None of the questions posed by these use cases of containment searches over container columns types (Array, JSON) are efficiently answerable by conventional indexes without full table scans. In order to support these data types in a way that is useful to our customers, we need to be able to index these data types in a more fine-grained way than the conventional regular indexes, and GIN indexes are the answer.
In YugabyteDB 2.11, we added support for generalized inverted indexes, also known as GIN indexes. GIN indexes were originally developed to support full-text search in PostgreSQL, but they are generalized in the sense that they can be used with other data types (in particular, arrays, tsvectors and jsonb documents).
You can create the index using USING ybgin to specify the index access method:
CREATE INDEX <optional name> ON <table> USING YBGIN (<column>);
GIN indexes store mappings from values within a container column to the row that holds that value, which helps speed up these types of searches or queries. GIN indexes give more than 3x performance improvement for containment searches on container column types.
The collation feature in YugabyteDB 2.11 allows specifying the sort order and character classification behavior of data per-column, or even per-operation according to language and country-specific rules. Conceptually, every expression of a collatable data type has a collation. The built-in collatable data types are text, varchar, and char.
Languages vary regarding how letters should be compared (and in which order they are to be applied), as well as what constitutes a fundamental element for sorting. Some examples are shown below:
|Language||Swedish||z < ö|
|Language||German||ö < z|
|Usage||German Dictionary||of < öf|
|Usage||German Phonebook||öf < of|
|Customizations||Upper-First||A < a|
|Customizations||Lower-First||a < A|
Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of a, thus following a.
It is crucial for databases that the same principles about collation behavior apply to realms beyond sorting. In particular, searching or matching should behave consistently with sorting.
Lack of collation for matching can cause significant problems for customers as the user may not realize that records are missing in a SELECT query. For example, if a German businessman making a database selection to sum up revenue in each of the cities starting from O to P for planning purposes does not realize that all cities starting with Ö were excluded because the query selection was using a Swedish collation, he will be one very unhappy customer.
Both PostgreSQL and YSQL rely on the operating system for collation support. More specifically, there are two collation support libraries: libc (the standard C library) and libicu (International Components for Unicode library). YSQL supports all the OS-supplied ICU collations and only a few libc collations such as the “C” collation and the “en_US ” collation. The pg_collation system catalog table contains all the collations that can be used in YSQL, including the predefined collations that are imported from libc and libicu at initdb time, and any user-defined collations created after that.
When no explicit collation is specified, all character string data types have a default collation. In addition to predefined collations, you can create new collations. You can define collation on columns to override the default “C” collation in order to have a different sort order on the column values.
When a table column has an explicit collation, an index built on the column will be sorted according to the column collation. YSQL also allows the index to have its own explicit collation that is different from that of the table column. This can be useful to speed up queries that involve pattern matching operators such as LIKE because a regular index will be sorted according to collation “en-US-x-icu” and such an index cannot be used by pattern matching operators.
Foreign data wrappers (FDWs) allow PostgreSQL databases to treat tables in remote databases as locally available tables. This enables developers to effortlessly write queries that access data in the external data source as if they were coming from a table in the PostgreSQL database. Support for FDWs allows developers to write a JOIN query that accesses data in an external cloud native database. Read this blog post by database expert and YugabyteDB community member Radek Gruchalski about how to use FDWs with YugabyteDB.
PostgreSQL uses a technique called multiversion concurrency control to isolate concurrent transactions and ensure data consistency. The SQL-92 standard defines four levels of transaction isolation: serializable, repeatable read, read committed, and read uncommitted.
We ensure user-visible transaction semantics in YSQL are the same as that of PostgreSQL to the extent theoretically possible barring any differences that stem from the distributed nature of the database.
YugabyteDB already supports the two strictest isolation levels, serializable and snapshot (analogous to repeatable read). YugabyteDB 2.11 adds support for read committed transaction isolation. Read Committed is the default isolation level in PostgreSQL. When a transaction uses this isolation level, a SELECT query (without a FOR UPDATE/SHARE clause) sees only data committed before the query began; it never sees either uncommitted data or changes committed during query execution by concurrent transactions. In effect, a SELECT query sees a snapshot of the database as of the instant the query begins to run. However, SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed.
With YugabyteDB 2.11, the database now provides faster reads in situations where you can afford to read data that is slightly stale.
It’s very common in real-world database schemas to have several small tables which represent distinct types of entities such as ‘status of orders’, ‘priority of financial assets’, ‘location codes’, ‘historical closing prices’, ‘type of warehouses’ , and TPCC’s items table. These reference tables seldom receive updates and are mostly used as foreign key tables in applications.
These kinds of tables present some challenges for globally distributed databases like YugabyteDB. In order to make sure that the reads are consistent, the read operation needs to go to the table’s leaseholder. This causes poor performance as the leaseholder can be geographically distant from the node that is issuing the query.
This poor performance is unacceptable for customers who desire low-latency reads against reference tables. However, follower reads can improve performance if these customers are willing to accept bounded staleness for reads on these tables.
Follower read is a key optimization useful for users who have deployed YugabytedDB in a multi-region, geo-distributed fashion. This staleness threshold is configurable and available only in read-only transactions.
The follower reads are not enabled by default. To enable follower reads you need to set the session variable yb_read_from_followers to true and mark the transaction as read only.
You can configure the staleness bound using yb_follower_read_staleness_ms session variable.
Let’s explore how you can enable and use follower reads:
SET yb_read_from_followers = true; START TRANSACTION READ ONLY; SELECT * from t WHERE k='k1'; --> follower read k | v ----+---- k1 | v1 (1 row) COMMIT;
You can also specify and override default staleness bound explicitly:
SET yb_read_from_followers = true; SET yb_follower_read_staleness_ms = 5000; START TRANSACTION READ ONLY; SELECT * from t WHERE k='k1'; --> follower read with 5 seconds staleness bound k | v ----+---- k1 | v1 (1 row) COMMIT;
To mark the transaction as read only, users can do one of the following:
- SET TRANSACTION READ ONLY – applies only to the current transaction block.
- SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY – applies the read-only setting for all statements and transaction blocks that follow.
- SET default_transaction_read_only = TRUE – applies the read-only setting for all statements and transaction blocks that follow.
ANALYZE gathers statistics for the query planner to create the most efficient query execution paths. These accurate statistics help the query planner choose the most appropriate query plan, thereby improving the speed of query processing. The YugabyteDB implementation is based on the framework provided by PostgreSQL, which requires the storage layer to provide a random sample of rows of a predefined size. The size is calculated based on several factors, such as the included columns’ data types.
Merge join is based on the idea of sorting the left- and right-hand tables into order and then scanning them in parallel. This needs the ammarkpos (Mark current scan position) and amrestrpos (Restore marked scan position) PostgreSQL functions to be supported in the access method. In the 2.11 release, we implemented these functions and then enabled merge joins. This offers faster, more efficient joins when input data is sorted on the join columns.
Additionally, we improved cost estimation and row size determination for other join types like nested loop and hash joins.
YugabyteDB stores its data in DocDB, a Google Spanner-inspired distributed document store. DocDB’s per-node storage engine is a customized fork of RocksDB. This enables a number of optimizations related to space amplification.
Minimizing space amplification is important for efficient hardware use because storage space can be a bottleneck in production environments. In a typical production environment, SSDs process far fewer reads and writes during peak times than what the hardware is capable of.
The per node query rate is low because the amount of data that has to be stored (and be accessible) is so large. It has to be sharded across many nodes to fit, and the more nodes, the fewer queries per node. If the SSD could store twice as much data, then we would expect storage node efficiency to double, since the SSDs could easily handle the expected doubling of IOPS, and we would need far fewer nodes for the workload.
This issue drives our focus on space amplification. In YugabyteDB 2.11 we made various improvements towards reducing space amplification.
Let’s go through these improvements in the following sections.
YugabyteDB offers the ability to compress network traffic between nodes of a cluster in order to reduce the amount of bandwidth used between peers. With various data compression algorithms, we significantly reduce costs associated with cross-AZ or cross-region network traffic with minimal CPU overhead. This feature supports different compression algorithms. In addition to the existing gzip, Snappy, and LZ4 compression algorithms we added Zlib. This feature supports easy enabling or disabling of compression with rolling upgrade support and provides the ability to select the compression algorithm of your choice.
RocksDB uses delta encoding of keys to reduce the storage footprint. The delta encoding scheme in RocksDB does simple prefix compression. This works well if the common part of the two adjacent keys is at the prefix, and the non-common part is at the end.
The YugabyteDB key format is something like:
For a row with say N non-primary key columns, N such keys are written that only differ in the column-id (and perhaps the uniq-write-id). Also note that in cases where the column values themselves are small, such as integers or small text fields, the key portion can dominate the space usage.
A simple prefix-compression based delta encoding scheme is only able to save the bytes to the left of the <column-id> and therefore not as effective. So in the 2.11 release we added a “custom” delta encoding scheme by not writing repeated prefixes of previous keys. The new custom delta encoding scheme gave nearly 90% savings on keys.
Instead of iterating through every K/V pair in the file to identify expired entries, we improve disk usage and compaction performance by directly removing files that have completely expired based on their TTL.
These new optimizations are just the latest in a long list related to space amplification that YugabyteDB already offers:
- Block-based splitting of bloom/index data: RocksDB’s index and bloom filters have been enhanced in YugabyteDB to be multi-level/block-oriented structures so that these metadata blocks can be demand-paged into the block cache much like data blocks. This enables YugabyteDB to support very large data sets in a RAM efficient and memory allocator friendly manner.
- Size-tiered compactions: YugabyteDB’s compactions are size tiered. This has the advantage of lower disk write (IO) amplification compared to level compactions. The space amplification concern of using size-tiered compactions does not hold true in YugabyteDB because each table is broken into several shards, and the number of concurrent compactions across shards is throttled. As a result, the typical space amplification in YugabyteDB tends to be not more than 10-20%.
- Smart load balancing across multiple disks: DocDB supports a just-a-bunch-of-disks (JBOD) setup of multiple SSDs and doesn’t require a hardware or software RAID. The RocksDB instances for various tablets are balanced across the available SSDs uniformly, on a per-table basis.
- Efficient C++ implementation: There is no “stop-the-world” GC that needs to happen, which helps keep latencies low and consistent.
- On-disk block compression: This capability lowers read/write IO while an in-memory uncompressed block cache results in very low CPU overhead and latency.
- Compaction throttling & queues: Globally throttled compactions and small/big compaction queues help mitigate against compaction storms overwhelming the server.
We’ve been on the journey to near-complete PostgreSQL compatibility for some time. Features such as stored procedures, triggers, partial indexes, hash joins, indexes, and user-defined types set us apart from the competition. With these new features in YugabyteDB 2.11, we’re making YugabyteDB the most familiar and developer-friendly distributed SQL database on the market.
At Yugabyte, we strive to be fully transparent with our customers and user community, and to that end we always share our product roadmap. Several database features are on the roadmap. Additionally, continued work to unlock even greater database performance is always a work in progress. Here are some notable features you can expect in upcoming releases. Note that the current roadmap is subject to change as we finalize our planning for the next releases.
- Upgrade to PostgreSQL 13: For latest features, new PostgreSQL extensions, performance, and community fixes
- Change Data Capture
- Support for Prepare transaction for two-phase commit or external transaction management systems
- Support for in-cluster PITR (point in time recovery) of YSQL databases, to a fixed point in time, across DDL and DML changes
- Support for materialized views
- Enable automatic table splitting by default
- Geo-partitioning support for the transaction status table
We’re thrilled to be able to deliver these enterprise-grade features in the newest version of our flagship product – YugabyteDB 2.11. We invite you to learn more and try it out:
- YugabyteDB 2.11 is available to download. You can install the release in just a few minutes.
- Join us in Slack for interactions with the broader YugabyteDB community and real-time discussions with our engineering teams.
NOTE: Following YugabyteDB release versioning standards, YugabyteDB 2.11 is a preview release. Many of these features are now generally available in our latest stable release, YugabyteDB 2.12.