Optimizing Correlated Subqueries with Semi Joins in PostgreSQL and YugabyteDB

Mark Peacock

Yugabyte expert Mark Peacock recently encountered a query in YugabyteDB (PostgreSQL) involving a correlated subquery written with ANSI join syntax. Correlated subquery optimization is a common topic, but this experience raised some valuable nuances in query structuring and PostgreSQL’s handling of EXISTS clauses. It also showed how this can differ in PRE-ANSI join syntax.

This blog explores a real-world optimization scenario in YugabyteDB (and PostgreSQL), where a correlated subquery is executed inefficiently due to nested loop overhead.

Below Mark details how PostgreSQL internally handles EXISTS clauses and highlights the importance of understanding Semi Join versus SubPlan. He shares insights from query planning analysis (using EXPLAIN ANALYZE) and covers the (surprisingly!) subtle impact of SQL syntax (ANSI vs. pre-ANSI).

The Original Problematic Query

SELECT product_id, product_name
FROM products
WHERE EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_products op ON op.order_id = o.id AND op.product_id = products.product_id
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE (NOT o.international OR o.sales_rep = 987654)
      AND o.cancelled = FALSE
      AND c.region IN (987654, 12345, 67890, 112233, 445566)
)
ORDER BY LOWER(product_name)
LIMIT 100;

The original query may have been inefficient as it used a correlated subquery in a way where the inner query depended on values from the outer query (op.product_id = products.product_id). A correlated subquery executes once for every row in the outer query. In this scenario, the inner query repeatedly executed nested loops, resulting in overhead.

Identifying Performance Issues Through EXPLAIN (ANALYZE)

Limit
  ->  Index Only Scan using product_index on products
        Filter: (SubPlan 1)
        Rows Removed by Filter: 1119
        SubPlan 1
          ->  YB Batched Nested Loop Join
                Join Filter: (o.customer_id = c.customer_id)
                Rows: 0 Loops: 1219
                ->  YB Batched Nested Loop Join
                      Join Filter: (o.id = op.order_id)
                      Rows: 2 Loops: 1219
                      ->  Index Scan using product_order_index on order_products op
                            Index Cond: (product_id = products.product_id)
                            Rows: 3 Loops: 1219
                      ->  Index Scan using orders_pkey on orders o
                            Index Cond: (id = op.order_id)
                            Filter: ((NOT cancelled)
                            AND ((NOT international) OR (sales_rep = 987654)))
                            Rows: 2 Loops: 1087
                ->  Index Scan using customers_pkey on customers c
                      Index Cond: (customer_id = o.customer_id)
                      Filter: (region = ANY ('{987654,12345,67890,112233,445566}'))
                      Rows: 0 Loops: 804

The execution plan indicated something wasn’t right. The execution plan shows a correlated subquery executed once per product row, involving multiple nested loops joining order_products, orders, and customers.

Optimizing the Query Structure

To address these performance issues, after some trial and error, I restructured the query by repositioning the correlation predicate (op.product_id = p.product_id) from the JOIN condition to the WHERE clause.

What Trial and Error Did I Do?

I experimented with indexing, tried LATERAL joins, etc, but there was no significant improvement.  However, when examining the EXPLAIN (ANALYZE) output, I noticed an unusual pattern, high loop counts with extremely low row counts in YB batched nested loop nodes.

I’d never seen this before(!) and thought it was something to focus on as batched nested loops usually have higher rows and lower loops.

->  Index Scan using product_order_index on order_products op
                        Index Cond: (product_id = products.product_id)
                        Rows: 3 Loops: 1219

Without detailed knowledge of PostgreSQL’s specific handling of EXISTS conditions at the time I approached the problem logically.

Seeing that the correlated predicate (op.product_id = p.product_id) was evaluated deep inside the inner loops of the subquery, I guessed at repositioning the correlated predicate closer to the outer table (making it the outer loop in the EXISTS subquery).

This created a more familiar execution plan.

->  YB Batched Nested Loop Join
      Join Filter: (order_products.order_id = orders.id)
      Rows: 1519 Loops: 3

I realized at this point it was a better plan, but only later learned that the EXISTS subquery was now transforming it into a Semi Join internally.

As I kept the correlated predicate within the subquery, the query technically remained correlated. However, this structural change enabled PostgreSQL’s optimizer to recognize the EXISTS condition as eligible for a Semi Join, performing decorrelation internally.

SELECT p.product_id, p.product_name
FROM products p
WHERE EXISTS (
    SELECT 1
    FROM order_products op
    JOIN orders o ON o.id = op.order_id
    JOIN customers c ON c.customer_id = o.customer_id
    WHERE op.product_id = p.product_id
      AND (o.international = FALSE OR o.sales_rep = 987654)
      AND o.cancelled = FALSE
      AND c.region IN (987654, 12345, 67890, 112233, 445566)
)
ORDER BY LOWER(p.product_name), p.product_id
LIMIT 100;

This improved structure allowed the database to filter rows earlier and make more efficient use of indexes.

Why This Adjustment Worked Better

PostgreSQL’s optimizer distinguishes between executing EXISTS subqueries as SubPlans (repeated execution per outer row) and Semi Joins (executed efficiently as joins). Moving the correlation condition into the WHERE clause allowed PostgreSQL to pick a Semi Join.

Taking a closer look at the faster plan (shown below) the Semi Join significantly reduced overhead because it allowed the query to stop early once matching rows were found (unlike Hash Join, Hash Aggregate, Sort, etc., which process all rows).

Interestingly (and knowledge I did not have at the time) PostgreSQL seems to specifically check the first table listed in the subquery to determine if a Semi Join can be used when ANSI JOIN syntax is involved.

What Else Would Have Worked? Pre-ANSI vs ANSI Syntax

While experimenting, I noticed that pre-ANSI syntax (comma-based joins) allowed PostgreSQL’s query planner more freedom to collapse joins.

This means pre-ANSI syntax could offer the same improvements (Semi Join) in the EXISTS subquery case with any table order, but the ANSI syntax reduces the planner’s flexibility to do so. Interestingly, if a developer wrote in pre-ANSI syntax, and chose the wrong outer table in the subquery, it would have been optimized to a Semi Join (in exchange for a little more planning time).

Performance Analysis of Optimized Query Through explain (ANALYZE)

The optimized query’s execution plan clearly shows the use of a Semi Join.

Limit
  ->  YB Batched Nested Loop Semi Join
        Join Filter: (products.id = order_products.product_id)
        Rows: 100 Loops: 1
        ->  Index Only Scan using product_index on products
              Rows: 2186 Loops: 1
        ->  YB Batched Nested Loop Join
              Join Filter: (orders.customer_id = customers.customer_id)
              Rows: 100 Loops: 3
              ->  YB Batched Nested Loop Join
                    Join Filter: (order_products.order_id = orders.id)
                    Rows: 1519 Loops: 3
                    ->  Index Only Scan using product_order_index on order_products
                          Index Cond: (product_id = ANY (ARRAY[products.id, $1.., $1023]))
                          Rows: 2219 Loops: 3
                    ->  Index Scan using orders_pkey on orders
                          Index Cond: (id = ANY (ARRAY[order_products.order_id... $2047]))
                          Filter: ((NOT cancelled) 
                          AND ((NOT international) OR (sales_rep = 2698000)))
                          Rows: 489 Loops: 9
              ->  Index Scan using customers_pkey on customers
                    Index Cond: (customer_id = ANY (ARRAY[orders.customer_id… $3071]))
                    Filter: (region = ANY ('{2698000,220675,15680519,10211817,8766959}'))
                    Rows: 35 Loops: 6

Conclusion

I initially approached optimization primarily with indexing in mind. However, this experience highlighted the importance of query restructuring to allow PostgreSQL to choose efficient execution strategies.

Although this rewrite didn’t eliminate the correlation, it encouraged PostgreSQL’s optimizer to internally decorrelate using a Semi Join strategy.

Understanding PostgreSQL’s EXISTS handling behavior—particularly Semi Join recognition versus SubPlan execution— I now see how useful it is when optimizing queries involving correlated subqueries.

Mark Peacock

Related Posts

Explore Distributed SQL and YugabyteDB in Depth

Discover the future of data management.
Learn at Yugabyte University
Get Started
Browse Yugabyte Docs
Explore docs
PostgreSQL For Cloud Native World
Read for Free