PostgreSQL vs. Distributed SQL: Understanding Behavior Differences
One common question we receive from new users testing distributed SQL is whether they should expect different behaviors compared to traditional, single-node PostgreSQL. While distributed SQL and PostgreSQL share similarities, it’s important to understand that that doesn’t guarantee identical behavior in all cases.
Distributed SQL is a database category that combines the familiar relational database features (found in PostgreSQL) with the scalability and availability advantages of NoSQL systems. Unlike single-node systems like PostgreSQL, distributed SQL operates on a cluster of nodes. It has high availability built in, is easily scalable, and distributes data across multiple regions.
To deliver these significant benefits in a single, unified database, distributed SQL is built on a different storage architecture that leads to behavioral differences. Many of these differences are bound by the laws of physics and cannot be completely eliminated—although different vendors approach them in different ways.
On the other hand, single-node systems, like traditional PostgreSQL, have their own set of problems in terms of sharding, scalability, and achieving near-zero recovery time objective (RTO) and recovery point objective (RPO).
So let’s take a look at the differences between distributed SQL and traditional PostgreSQL along with strategies to negate those differences. We will focus on three areas where the behavior can differ:
- Network latencies
- Connection pooling
- Transaction conflicts
Note: YugabyteDB 2.17.3 will be used to illustrate points made about distributed SQL.
In distributed SQL, tables and indexes are automatically sharded and distributed across the cluster nodes, enabling easy scalability as application throughput grows. This scalable, highly-available architecture means that queries often need to travel the network between nodes. As a result, the queries in a distributed SQL database may have higher latencies when compared to single node databases since data retrieval may involve multiple nodes.
Now in YugabyteDB, queries that use the primary key in the WHERE clause and interact with a single shard will perform as expected compared to single-node PostgreSQL; latency is much closer to Postgres. This is attributed to the fast path which is used by single-shard transactions, allowing them to bypass work that’s done to guarantee cross-shard consistency. To optimize query performance, try to ensure that simple transactions do not (unnecessarily) become distributed transactions.
The primary differences in latency between PostgreSQL and distributed SQL databases are usually observed in distributed transactions. Each query involving multiple shards* requires additional remote procedure calls (RPCs) to fetch the required data. When dealing with tables containing millions of rows, performing large table scans and JOIN operations may result in higher latencies compared to PostgreSQL. As mentioned before, YugabyteDB, along with other distributed OLTP database systems, are designed specifically to be highly available, horizontally scale easily, and get data closer to the user. To enhance speed for tables that primarily require large table scans, you can consider using RANGE sharding instead of the default HASH sharding method. Various techniques exist to design distributed indexes and optimize query performance.
- How to design distributed indexes for optimal query performance>>>
- When to use range sharding vs. hash sharding>>>
For tables that do not require sharding and scalability, colocation allows grouping small tables into a single shard*, improving latency by reducing network trips while ensuring data replication across nodes.
*A shard equals a tablet in YugabyteDB
Other best practices to optimize performance include:
- Minimizing network latencies between the client and the database and keeping roundtrip requests to a minimum.
- Leveraging stored procedures to reduce roundtrips and improve transaction efficiency.
- Utilizing yb_stats to perform an in-depth analysis of the impact of latencies
When focusing on single point queries, YugabyteDB behavior aligns very closely with PostgreSQL. However, features like JOINS and subqueries can be more costly in any distributed database if they involve scanning a significant amount of data. To improve this YugabyteDB is actively enhancing expression pushdowns functionality, packed rows, and batched nested loops (in preview) to improve the performance of these queries.
When establishing a new connection in traditional PostgreSQL, metadata from the internal catalog tables needs to be cached. YugabyteDB, because it utilizes the PostgreSQL query layer, does the same caching. And because of the distributed nature of YugabyteDB’s storage layer, the catalog is also distributed.
When establishing new connections, the YugabyteDB Tablet Server (or TServer) you connect to sends the request to the master leader, which might be located on another server. As a result, there may be an increase in latency. Similarly, when executing a statement using a different table, additional catalog data might need to be fetched if not cached already. This also leads to an increase in latency, although it is much less significant than when establishing a new connection.
For a more comprehensive understanding of the process, you can explore the topic of Postgres’ first database connection latency. It’s worth noting that once this initial connection is completed, subsequent queries benefit from the cached information, leading to significantly lower latencies. NOTE: This is true for both PostgreSQL and YugabyteDB, and follows the same latency pattern/principle—the difference being that catalog data is cached in the backend, and regular data is cached in the server. Therefore, the impact of this process is primarily felt during the first connection establishment and when making schema changes, since schema changes necessitate a catalog refresh.
To optimize performance and minimize network overhead, we recommend using a connection pooler.In Java, it is recommended to utilize an in-app connection pooler like Hikari or Tomcat. The reason is that even though they do incur latency penalties when obtaining connections, once the pool establishes a connection, it can be reused by the application. At that point, there will not be a latency penalty, since it can use the existing connection.
If an in-app connection pooler is not feasible, external options such as pgbouncer, odyssey, or pgcat can be used. These connection poolers multiplex multiple application connections into a few physical connections to the database. By reusing connections, the need to frequently cache this data is reduced, depending on the settings.
Various improvements to enhance connection management in YugabyteDB are underway, including efforts around adding a built-in connection pool and implementing changes aimed at optimizing connection scaling, such as eliminating RPCs and locally loading the cache. You can already enjoy significant performance improvements by enabling ysql_enable_read_request_caching and ysql_catalog_preload_additional_tables.**
**These flags are not yet recommended for production and may not benefit all use cases.
While not true for many distributed SQL databases, both PostgreSQL and YugabyteDB fully support the Read Committed isolation level. While PostgreSQL sets it as the default isolation level, in YugabyteDB, the default is the much stricter Snapshot Isolation, which maps to Repeatable Read in PostgreSQL. Because of this level of strictness, users may occasionally incur a 40001 serialization error that may require the implementation of retries in the application logic.
YugabyteDB does support Read Committed (it is just not the default) to mimic the experience users expect with any PostgreSQL-compliant database. Our goal is to make Read Committed the default setting and implement a wait-based queue—rather than the retry-based queue that is available today—to further enhance performance.
For more information:
- Visit Github for more information on Wait-Based Queues
- Read Strong Consistency with YugabyteDB to learn more about transaction isolation levels and their implications.
Yugabyte made the strategic decision to leverage PostgreSQL as a core building block for our database. Instead of trying to reinvent the wheel, we leveraged the existing PostgreSQL query layer, and the results have been exceptional. This approach enabled us to provide key SQL functionality, including stored procedures, triggers, and functions. Additionally, we can seamlessly support widely-used drivers, ORMs, and the extensive PostgreSQL ecosystem. This, in turn, has paved the way to supporting different third-party tools and integration.
But our commitment to PostgreSQL compatibility goes beyond simply reusing the query layer. We understand that “compatibility” encompasses a broad spectrum, since there are four distinct levels of PostgreSQL compatibility.
By adopting the actual PostgreSQL engine and reusing its source code, we have effectively eliminated barriers to entry for distributed SQL. YugabyteDB’s visionary approach combines the best of both worlds: the innovation of a cutting-edge database and the compatibility of an industry-leading ecosystem. This harmonious integration enables developers and businesses to embrace the future confidently, backed by a reliable and adaptable solution.
YugabyteDB and PostgreSQL both offer unique strengths and considerations, making them ideal choices for different scenarios. Today, YugabyteDB excels when horizontal scalability, built-in high availability, and multi-regional capabilities are required. Its distributed storage layer empowers you with these advanced features, while the query layer allows your application to talk to our database as if it were PostgreSQL. The result is a powerful foundation for today’s cloud native applications.
If your application doesn’t need these specific capabilities, PostgreSQL remains a great option. With its established track record and extensive ecosystem, it remains a reliable choice for traditional single node database requirements.
The YugabyteDB team is continuously working on ongoing enhancements, such as colocation, data caching, and memory communication improvements. YugabyteDB is steadily approaching the performance levels and latencies synonymous with PostgreSQL. As we strive for PostgreSQL parity, YugabyteDB presents an increasingly compelling choice for a wider and wider set of applications that—now or in the future— may need the scalability, resilience, and geo-distributed/data residency capabilities provided by its distributed architecture.
If you are looking to speak to the YugabyteDB team to better understand if we are the right technology for your application, please reach out to us. We would also love for you to join us on Slack and Twitter to follow our journey to becoming the default database for the cloud native world.