An introduction to yb_stats
To fully understand your database, it’s important to have runtime information available in a single location. For monolithic databases, this information is available on the database server; however, the classic divide remains between the filesystem information (typically the logfile) and the in-database tables and views.
With distributed databases, like YugabyteDB, the database typically runs on multiple machines. If containers are used, the different processes run within their own containers/pods. Therefore, needed information is scattered across different machines/containers/pods.
A lot of the information about a single YugabyteDB daemon, like a tablet server or a master server, is provided through its web UI. For more cluster-wide management we provide YugabyteDB Anywhere, which is typically used for deployment, creating backups, etc. However, to get the “raw facts” from a YugabyteDB cluster into a single location, you can use the yb_stats tool.
yb_stats is used to obtain YugabyteDB cluster status which can be used for troubleshooting, ad hoc analysis, and support. It can gather all needed facts from every YugabyteDB cluster component and store them in a “snapshot”. A snapshot here represents what has been gathered from a YugabyteDB cluster at a single point in time. The facts only contain metadata—not any actual user data.
A yb_stats snapshot contains a great deal of information, including:
- All masters and tablet servers’ performance counters and gauges.
- All masters and tablet servers’ performance “histograms”, including different percentiles and a counter and number of occasions of the histogram’s topic.
- YSQL level performance counters.
- YSQL level statement performance data, which is a full overview of pg_stat_statements and their statistics.
- YCQL level performance counters and gauges.
- All masters and tablet servers gflags.
- All masters and tablet servers mem-trackers page information.
- All masters and tablet servers log output (last 1M).
- All masters and tablet servers versions.
- All masters and tablet servers threads backtrace and runtime statistics.
- All masters and tablet servers, YSQL and YCQL RPC (network connections).
- All masters dump of entities (DocDB database, object, tablet and replicas).
- All masters detailed status.
- All masters and tablet servers dump of /memz endpoint.
- All masters and tablet servers dump of /pprof/growth endpoint.
- All node_exporter counters and gauges.
yb_stats has two goals. First, it fetches all the available information that can be obtained via HTTP in one go. Second, it eliminates the endless cycle of investigation, more data collection, investigation, more data collection, etc.
yb_stats stores the cluster’s data—in its entirety—in a single place as CSV files.This allows you to investigate the files manually and load the data into a database. It also allows security officers to investigate the collected data. The best way of using the data is using the yb_stats utility, which can read the CSV data and process it to show the specific data needed for the investigation using filters.
The normal cycle of use is:
- Choose a node to run yb_stats snapshots. The node must be able to see all cluster http endpoints. This is typically a management node or the first server in a YugabyteDB cluster. In addition, you need to be able to log on to that node via ssh.
- Install yb_stats via the RPM package for CentOS/Alma/any other RH compatible clone version 7 or 8 via yum. There also is a homebrew tap for Mac OSX available, or you can install via the source.
- Perform the first yb_stats execution, specifying the hosts or endpoint IP addresses and optionally the port numbers (if these have been changed from the YugabyteDB defaults). By doing so, yb_stats will write a .env file in the current working directory that stores hostnames or IP addresses, ports, and parallelism.
- Invoke yb_stats without any argument for ad-hoc performance query mode (which doesn’t store anything). Or you can invoke yb_stats with the –snapshot (and optionally the –snapshot-comment switch) to perform a full snapshot that stores all data in CSV files.
When yb_stats is used with the –snapshot switch, it will store the data, but it will not output a lot of the data. It’s sole focus is to store the data:
% yb_stats --snapshot snapshot number 6
However, if you do not specify –snapshot, you get:
% yb_stats Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
It tells you it created a begin snapshot (which is in-memory). Press enter to create an end snapshot to calculate and display the difference between the beginning and ending snapshot, per server. If you do so, you will get:
Time between snapshots: 121.299 seconds 192.168.66.80:12000 server cpu_stime 730 ms 6.019 /s 192.168.66.80:12000 server cpu_utime 442 ms 3.644 /s 192.168.66.80:12000 server involuntary_context_switches 2 csws 0.016 /s 192.168.66.80:12000 server server_uptime_ms 121288 ms 999.992 /s 192.168.66.80:12000 server threads_started 4 threads 0.033 /s 192.168.66.80:12000 server threads_started_thread_pool 4 threads 0.033 /s 192.168.66.80:12000 server voluntary_context_switches 38618 csws 318.397 /s 192.168.66.80:7000 server cpu_stime 115 ms 0.948 /s ...much data... 192.168.66.82:9300 counter node_vmstat_pgpgout 138.000000 1.140 /s 192.168.66.82:9300 counter node_xfs_block_mapping_extent_list_insertions_total_sdb1 1.000000 0.008 /s 192.168.66.82:9300 counter node_xfs_block_mapping_extent_list_lookups_total_sdb1 5.000000 0.041 /s 192.168.66.82:9300 counter node_xfs_block_mapping_reads_total_sdb1 2.000000 0.017 /s 192.168.66.82:9300 counter node_xfs_block_mapping_writes_total_sdb1 1.000000 0.008 /s 192.168.66.82:9300 counter node_xfs_read_calls_total_sdb1 3.000000 0.025 /s
These are all the yb_stats statistics, which are the counter-based and counter- and sum-based statistics per server, and the node_exporter counter based statistics for all the specified nodes.
The above ad-hoc mode doesn’t store anything. It’s typically used when a testing cycle is performed in a controlled environment, which means that storing every begin and end situation of testing would mean too much data. In most of the other situations, and especially in client situations, it’s important to understand the entire picture. You also don’t want to continuously fetch data, especially if that requires asking someone else to do that. In such situations, storing a snapshot will get you all the available information, which you can use and saves you from needing the cluster to be available to look it up. You might also want other people to look at the data who may not be available or don’t have access to the cluster you are looking at. All the data in the snapshot is persistent. It does not change.
Once two or more snapshots are taken, a difference overview can be obtained using the yb_stats –snapshot-diff switch.
% ./target/release/yb_stats --snapshot-diff 0 2022-11-06 14:57:19.801329 +01:00 1 2022-11-06 15:00:08.100975 +01:00 2 2022-11-06 15:02:09.157553 +01:00 3 2022-11-06 17:00:14.007897 +01:00 4 2022-11-07 22:17:35.932471 +01:00 5 2022-11-08 14:51:56.669687 +01:00 6 2022-11-08 14:54:42.636357 +01:00 7 2022-11-08 15:05:27.319409 +01:00 Enter begin snapshot: 6 Enter end snapshot: 7 192.168.66.80:12000 server cpu_stime 4236 ms 6.573 /s 192.168.66.80:12000 server cpu_utime 2277 ms 3.533 /s 192.168.66.80:12000 server glog_info_messages 95 msgs 0.147 /s 192.168.66.80:12000 server involuntary_context_switches 38 csws 0.059 /s 192.168.66.80:12000 server server_uptime_ms 644483 ms 999.983 /s 192.168.66.80:12000 server threads_started 32 threads 0.050 /s 192.168.66.80:12000 server threads_started_thread_pool 32 threads 0.050 /s 192.168.66.80:12000 server voluntary_context_switches 207444 csws 321.871 /s 192.168.66.80:7000 server cpu_stime 698 ms 1.083 /s ... 192.168.66.82:9300 counter node_xfs_vnode_reclaim_total_sdb1 12.000000 0.019 /s 192.168.66.82:9300 counter node_xfs_vnode_release_total_sdb1 12.000000 0.019 /s 192.168.66.82:9300 counter node_xfs_vnode_remove_total_sdb1 12.000000 0.019 /s 192.168.66.82:9300 counter node_xfs_write_calls_total_sdb1 244.000000 0.379 /s
NOTE: The difference between using the ad-hoc mode and snapshot-diff mode is that the ad-hoc mode does not require stored files.
When using snapshots, more data is stored, for which it doesn’t make sense to provide it in a difference overview—shown above—such as logs or RPC network connections. The complete list of additional snapshot data is listed above.
For example, after upgrading to a YugabyteDB cluster, you can validate that all components were successfully upgraded by using a snapshot to print the version information to validate that all cluster servers show the version that is expected:
% yb_stats --print-version 7 hostname_port version_number build_nr build_type build_timestamp git_hash 192.168.66.82:9000 184.108.40.206 231 RELEASE 22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8 192.168.66.82:7000 220.127.116.11 231 RELEASE 22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8 192.168.66.81:9000 18.104.22.168 231 RELEASE 22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8 192.168.66.81:7000 22.214.171.124 231 RELEASE 22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8 192.168.66.80:9000 126.96.36.199 231 RELEASE 22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8 192.168.66.80:7000 188.8.131.52 231 RELEASE 22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8
Many more print options exist, such as logs, masters, RPC ports, gflags, etc.
There are many more options, such as reducing the output by filtering on hostname, statistic name, or table name; adding gauge type statistics; enabling details to split the per table and tablet statistics to their table and tablets; and many more –print options.
If you want to learn more about yb_stats: