Optimizing YugabyteDB Memory Tuning for YSQL
Want to know how to optimize YugabyteDB memory tuning for YSQL? Read on!
Like most database software, YugabyteDB provides default memory configuration values. These defaults should be evaluated for specific use cases and workloads.
The three main buckets of memory utilization are:
- The tserver process
- The master process
- The postgres process
Not all nodes in the cluster have a master process.
This blog provides an overview of each of these buckets and offers best practice considerations for tuning memory with a YSQL workload.
The tserver process is configured by default to use up to 85% of available RAM.
Within this process are several sub-processes. These include the block cache, various read buffers, memtables for the tablets, and the log (write ahead log, or WAL) cache.
The master process is configured by default to use 10% of RAM with similar sub-processes.
This leaves 5-15% available for YSQL and the operating system.
If you are pre-sizing memory settings for a universe to test a ysql workload, it is reasonable to change the tserver default_memory_limit_to_ram_ratio to .5 (50% of total available RAM).
This starting rule of thumb allows the postgres processes to use more of the available RAM while leaving the tserver processes an average ratio of RAM observed in production.
However, watch the tserver memory utilization in the root section of the Memory Breakdown (see below) and the logs to ensure 50% is sufficient.
If you have a running universe, check the Memory Breakdown for each tserver to find current and max utilized tserver memory and to tune correctly.
The Memory Breakdown can be found in the UI for each tserver under Utilities.
The memory components are:
- root (total allowed for the tserver process)
- BlockBasedTable (block cache)
- Read Buffer
- Tablets / memtables
- log cache (WAL)
Note: You may find that if the tserver process does not use the full memory, other processes (like postgres) can use the excess memory. To clarify, the ratio of memory configured is not pre-allocated to the tserver process. However, it is best to tune the configuration to avoid out-of-memory events due to competition for memory.
If you prefer to set the allowed tserver memory to a specific value, you can do this with memory_limit_hard_bytes, which will override the default_memory_limit_to_ram_ratio.
To avoid hitting either of these limits, when memory utilization gets to the level of the configured memory_limit_soft_percentage (by default 85% of allowed tserver memory), backpressure will be applied to new requests to allow running requests to complete.
The block cache is the largest user of tserver memory and is shown in the Memory Breakdown as BlockBasedTable. This is used to store hot/warm data in memory to optimize reads. The configuration (db_block_cache_size_percentage) is by default 50% and shared across all RocksDB instances running on the server.
If a specific value is desired, you can instead set db_block_cache_size_bytes. Consider increasing the block cache setting if you observe that the block cache miss rate is high via the Cache Hits & Misses graph.
Each tablet has a memstore, used to buffer writes before flushing to SST files on disk. After a memstore reaches the configured memstore_size_mb (default 128M), a maintenance thread will come to flush it.
A memstore can grow larger than this configured value if it is being actively written and/or the maintenance thread has not reached it. For a heavy write workload, consider tuning this higher, but be aware of the number of tablets, because this is sized per tablet. Also, consider the effect of larger flushing vs. frequent flushing on the disk i/o capacity.
The total amount of memory used by the memstores is controlled via global_memstore_size_percentage (default 10% of tserver memory) with a hard limit of global_memstore_size_mb_max (default 2G). For a heavy write workload, consider tuning these higher.
The log cache (log_cache_size_limit_mb) is set per tablet and is by default 128M. This cache holds the consensus entries that have not yet been replicated to followers.
The total upper limit is set via global_log_cache_size_limit_mb (default 1G). Consider tuning this higher if you are frequently reaching the limit via the WAL Stats / Node graph (and determine the cause for latency between nodes).
A variety of read buffers share the memory allocated via read_buffer_memory_limit (default 5% of tserver memory). The normal read buffer is always used.
The compressed read buffer is used if you have RPC compression turned on. If you have TLS turned on, the encrypted read buffer is used.
Memory configuration is similar for the master process, and similar tuning notes apply. However, the master process does not have the large number of tablets that the tserver process does. This is reflected in the lower master default_memory_limit_to_ram_ratio of 10%.
Size your VM memory for tserver (and master) processes first, as these are critical for a performant distributed system. Increasing available memory for the postgres process may require upsizing the RAM on your VMs.
Start with a query analysis to ensure your queries are well-tuned and use resources wisely. It is also worth considering the following for memory usage by different workloads:
- The default value for max_connections is 300 (per node). Even when idle, connections take about 6MB of memory. A large catalog (more tables and therefore more tablets) also requires more memory overhead.
- For active connections, prepared statements and point queries take less memory.
- Aggregate queries take more memory than point queries because work_mem is allocated per sorting operation (order by, distinct, merge joins) or hash operation (hash joins, hash-based aggregation, result cache nodes, and hash-based processing of IN subqueries) within a query. The work_mem can be increased for just hash queries using the hash_mem_multiplier, which is by default 1.
- If autocommit is turned on (it is on by default unless you issue transactional statements), the resultset will build up in memory completely before returning the result to the client.
Query operations that exceed configured work_mem will write to temporary files on disk.
The size of all temporary files used is configured via temp_file_limit, and is by default unlimited. If you limit this size, queries using files above this setting will fail.
If you leave this as unlimited, consider turning on temp file logging via log_temp_files and monitoring disk utilization to ensure you will not exceed disk space.
Note: Hash aggregate (group by) operations can exceed available work_mem if the planner thinks the number of groups is lower than the actual number of groups. This is fixed in postgres 13+.
You can ensure maximum performance for your workload and avoid out-of-memory events by observing your memory utilization through YugabyteDB’s various observability windows and by sizing your memory correctly for the tserver, master, and ysql processes.
Download YugabyteDB for free today and drop a comment below if you have any further questions on how to optimize YugabyteDB memory tuning for YSQL.