The Curious Case of the Ever-Increasing Tombstones

January 31, 2023

In storage systems based on LSM1/MVCC2, when a record is deleted, it’s not immediately removed. A delete marker (also called a Tombstone) is placed on the record, effectively making it invisible to the user. This is not necessarily a problem for point lookups. However, as the number of tombstones increases, it could adversely affect the performance of a scan. In this blog, we are going to understand this problem and come up with solutions to get around it.

Consider This Scenario

Let’s look at an events table, which stores event information with an event ID, the timestamp at which the event was generated, and some data associated with the event. To help with fetching the latest/earliest event, let’s add an index on the timestamp in ascending order.

CREATE TABLE events (
    id int,
    ts timestamp,
    data text, 
    PRIMARY KEY (id)
);
CREATE INDEX idx_ts on events(ts ASC);

Data Generation

Let’s add a million events to this table.

INSERT INTO events(id, ts, data)
 SELECT n, now() - make_interval(mins=>1000000-n, secs=>EXTRACT (SECONDS FROM now())), 'eventdata-' || n FROM generate_series(1, 1000000) n;

--- Sample Data
----+---------------------+--------------
 id |         ts          |     data
----+---------------------+--------------
  1 | 2021-02-25 08:16:00 | eventdata-1
  2 | 2021-02-25 08:17:00 | eventdata-2
  3 | 2021-02-25 08:18:00 | eventdata-3
  4 | 2021-02-25 08:19:00 | eventdata-4
  5 | 2021-02-25 08:20:00 | eventdata-5

Event Processing

Let’s process these events in the order they arrived and then delete them once they are processed. To find the earliest event, we would run a select query, process the fetched event, and then delete the event using the retrieved id.

SELECT * FROM events ORDER BY ts ASC LIMIT 1;

 id |         ts          |    data
----+---------------------+-------------
  1 | 2021-02-25 08:16:00 | eventdata-1
(1 row)

--- Perform the required processing for the event

--- Remove the event from our table
DELETE FROM events WHERE id=1;

Let’s take a look at the EXPLAIN ANALYZE output for the above select.

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual time=1.818..1.820 rows=1 loops=1)
   Output: id, ts, data
   ->  Index Scan using idx_ts on public.events (actual time=1.808..1.808 rows=1 loops=1)
         Output: id, ts, data
 Planning Time: 0.449 ms
 Execution Time: 2.140 ms
 Peak Memory Usage: 0 kB
(7 rows)

The query used the index that we created earlier and processed just one row as expected. The actual time taken to return that row is around 1.8ms. This is reasonable.

Houston, we have a problem

Items have been processing well, but over time, after we have processed many events, we notice that the performance has degraded. (To simulate the passage of time, let’s mark the first 500K Items as completed)

DELETE FROM events WHERE id <= 500000; 
DELETE 500000
Time: 23807.481 ms (00:23.807)

Let’s try to fetch the earliest item.

SELECT * FROM events ORDER BY ts ASC LIMIT 1;
   id   |         ts          |       data
--------+---------------------+------------------
 500001 | 2022-02-07 13:36:00 | eventdata-500001
(1 row)

Time: 2016.854 ms (00:02.017)

Notice that the query now takes more than 2000ms. Previously, it took only 1.8ms. Let’s look at the EXPLAIN ANALYZE output again.

                             QUERY PLAN
----------------------------------------------------------------------------------
 Limit (actual time=2064.324..2064.326 rows=1 loops=1)
   Output: id, ts, data
   ->  Index Scan using idx_ts on public.events (actual time=2064.319..2064.319 rows=1 loops=1)
         Output: id, ts, data
 Planning Time: 0.217 ms
 Execution Time: 2064.389 ms
 Peak Memory Usage: 0 kB
(7 rows)

A lot of time was spent on the Index-Scan to fetch just one row. Theoretically, this should be very fast. Let’s try to break down how the timings are reported. The actual time reported here has two parts. The first number is the time to return the first item of that subquery. The second number is the total time taken for all the items in that subquery to be retrieved. These two numbers together tells us that the query executor spent a lot of time seeking to the first item with the least ts. Our index is ordered in ASC order of ts and hence this should have been a quick seek, but it’s not. What is going on here?

To figure this out, we need to get a basic understanding of the underlying storage structure – the Log Structured Merge (LSM) tree which stores all data modifications in multiple files called SSTs(Sorted String Tables). Although the design details of LSMs are out of scope for this article, let’s do a very simple overview.

SST Overview

When we deleted the event rows, they were marked as deleted in the underlying SST. Below is a simplified illustration of the merged view of the underlying SSTs for clarity.

# our index is [ts ASC]
2021-02-25 08:16:00,1      <DEL>  <= Actual search start position
2021-02-25 08:17:00,2      <DEL>
2021-02-25 08:18:00,3      <DEL>
. . .
. . .
2022-02-07 13:35:00,500000 <DEL>
 
2022-02-07 13:36:00,500001        <= Preferred search start position
2022-02-07 13:37:00,500002

Now let’s take a look at the internal metrics3.

 tablet-id       | table   | host           |   db_seek
-----------------+---------+----------------+-----------
 4a2a77e54052... | idx_ts  | 127.0.0.1:9000 |    500001

The number of seeks is 500,001. It should have been technically just 1. What is happening here is that when we are looking for the earliest item, which is currently 500001, it has to go through the first 500000 items. Since they were deleted, it decided to skip over them.. This is because the executor did not know the optimal place to start its search, since we are looking for the earliest time. While in this particular test case the first 500000 lowest timestamp events have been deleted, in the most cases the deletes may not be of a contiguous range of events (when ordered by timestamps). So the search will start at the earliest end of the range. So the tombstones are adversely affecting our scan performance!!

When we do a compaction, this problem disappears since the tombstones are removed. Because compactions can be an expensive operation, by default the compactions are not run too aggressively; they are a background operation. The other point to note is that only full compactions (which are really expensive) will remove the tombstones.

This problem is not specific to LSM implementations alone. But can also happen due to MVCC where databases keep older records for a certain duration to make concurrent changes possible. Also, if we do a batch deletion and do a compaction with the intention to remove the tombstones, it may not have the desired effect as MVCC will retain snapshots until the retention time. For example, see this article “Postgres Job Queues & Failure By MVCC2″ which also talks about this issue.

Workaround

Luckily, there is a simple trick to get around this problem. All we need to do is help the executor figure out an approximate point to start its search. For this, we can use the last seen ts (used as <last_ts>). To get to 500001, we can ask the executor to start its search from 500000! We need to modify our item selection query slightly to:

# last_ts = '2022-02-07 13:35:00' (ts of 500,000)
SELECT * FROM events WHERE ts > <last_ts>  ORDER BY ts ASC LIMIT 1;

We can initialize <last_ts> to ‘-infinity’ the very first time, and, after execution of each query, use the ts returned from that result as <last_ts>. This will help the executor with the right starting position for the above query. We can also store the <last_ts> in a separate table so that it can be used and updated by various workers.

Suppose the timestamp of the event we processed last was ‘2022-02-07 13:35:00’, we can pass that as a guiding input to retrieve the next event. Let’s look at the EXPLAIN ANALYZE output of this approach

SELECT * FROM events WHERE ts > '2022-02-07 13:35:00'  ORDER BY ts ASC LIMIT 1;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual time=1.647..1.650 rows=1 loops=1)
   Output: id, ts, data
   ->  Index Scan using idx_ts on public.events (actual time=1.642..1.642 rows=1 loops=1)
         Output: id, ts, data
         Index Cond: (events.ts > '2022-02-07 13:35:00'::timestamp without time zone)
 Planning Time: 0.202 ms
 Execution Time: 1.705 ms
 Peak Memory Usage: 8 kB
(8 rows)

We can easily notice that the time taken to retrieve the earliest row is just 1.6ms vs the 2064ms we saw earlier. Let’s look at the internal metrics again.

 tablet-id       | table   | host           |   db_seek
-----------------+---------+----------------+-----------
 4a2a77e54052... | idx_ts  | 127.0.0.1:9000 |         1

The number of seeks is now just 1! We’ve overcome the tombstone issue!

Conclusion

Scan performance degradation because of tombstones is a common problem in SST-based storage systems. If not cleaned up regularly through compaction they’re bound to cause issues. YugabyteDB is planning future work has been to automatically make compactions more aggressive in situations when many tombstones are encountered during scans to handle situations where it might not always be feasible to modify the application or for scenarios where the delete tombstones are scattered and not necessarily on one end of the sorted range.

References:

Reported metrics were fetched from YugabyteDB cluster and processed using Franck Pachot’s ybwr.sql script.

January 31, 2023

The Curious Case of the Ever-Increasing Tombstones

Related Posts

Explore Distributed SQL and YugabyteDB in Depth