An introduction to yb_stats

November 24, 2022

To fully understand your database, it’s important to have runtime information available in a single location. For monolithic databases, this information is available on the database server; however, the classic divide remains between the filesystem information (typically the logfile) and the in-database tables and views.

With distributed databases, like YugabyteDB, the database typically runs on multiple machines. If containers are used, the different processes run within their own containers/pods. Therefore, needed information is scattered across different machines/containers/pods.

A lot of the information about a single YugabyteDB daemon, like a tablet server or a master server, is provided through its web UI. For more cluster-wide management we provide YugabyteDB Anywhere, which is typically used for deployment, creating backups, etc. However, to get the “raw facts” from a YugabyteDB cluster into a single location, you can use the yb_stats tool.

What is yb_stats?

yb_stats is used to obtain YugabyteDB cluster status which can be used for troubleshooting, ad hoc analysis, and support. It can gather all needed facts from every YugabyteDB cluster component and store them in a “snapshot”. A snapshot here represents what has been gathered from a YugabyteDB cluster at a single point in time. The facts only contain metadata—not any actual user data.

A yb_stats snapshot contains a great deal of information, including:

All masters and tablet servers’ performance counters and gauges.
All masters and tablet servers’ performance “histograms”, including different percentiles and a counter and number of occasions of the histogram’s topic.
YSQL level performance counters.
YSQL level statement performance data, which is a full overview of pg_stat_statements and their statistics.
YCQL level performance counters and gauges.
All masters and tablet servers gflags.
All masters and tablet servers mem-trackers page information.
All masters and tablet servers log output (last 1M).
All masters and tablet servers versions.
All masters and tablet servers threads backtrace and runtime statistics.
All masters and tablet servers, YSQL and YCQL RPC (network connections).
All masters dump of entities (DocDB database, object, tablet and replicas).
All masters detailed status.
All masters and tablet servers dump of /memz endpoint.
All masters and tablet servers dump of /pprof/growth endpoint.
All node_exporter counters and gauges.

yb_stats has two goals. First, it fetches all the available information that can be obtained via HTTP in one go. Second, it eliminates the endless cycle of investigation, more data collection, investigation, more data collection, etc.

yb_stats stores the cluster’s data—in its entirety—in a single place as CSV files.This allows you to investigate the files manually and load the data into a database. It also allows security officers to investigate the collected data. The best way of using the data is using the yb_stats utility, which can read the CSV data and process it to show the specific data needed for the investigation using filters.

How does that work?

The normal cycle of use is:

Choose a node to run yb_stats snapshots. The node must be able to see all cluster http endpoints. This is typically a management node or the first server in a YugabyteDB cluster. In addition, you need to be able to log on to that node via ssh.
Install yb_stats via the RPM package for CentOS/Alma/any other RH compatible clone version 7 or 8 via yum. There also is a homebrew tap for Mac OSX available, or you can install via the source.
Perform the first yb_stats execution, specifying the hosts or endpoint IP addresses and optionally the port numbers (if these have been changed from the YugabyteDB defaults). By doing so, yb_stats will write a .env file in the current working directory that stores hostnames or IP addresses, ports, and parallelism.
Invoke yb_stats without any argument for ad-hoc performance query mode (which doesn’t store anything). Or you can invoke yb_stats with the –snapshot (and optionally the –snapshot-comment switch) to perform a full snapshot that stores all data in CSV files.

And what do I get?

When yb_stats is used with the –snapshot switch, it will store the data, but it will not output a lot of the data. It’s sole focus is to store the data:

% yb_stats --snapshot
snapshot number 6

yb_stats ad-hoc mode

However, if you do not specify –snapshot, you get:

% yb_stats
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

It tells you it created a begin snapshot (which is in-memory). Press enter to create an end snapshot to calculate and display the difference between the beginning and ending snapshot, per server. If you do so, you will get:

Time between snapshots:  121.299 seconds
192.168.66.80:12000  server   cpu_stime                                                                          730 ms               6.019 /s
192.168.66.80:12000  server   cpu_utime                                                                          442 ms               3.644 /s
192.168.66.80:12000  server   involuntary_context_switches                                                         2 csws             0.016 /s
192.168.66.80:12000  server   server_uptime_ms                                                                121288 ms             999.992 /s
192.168.66.80:12000  server   threads_started                                                                      4 threads           0.033 /s
192.168.66.80:12000  server   threads_started_thread_pool                                                          4 threads           0.033 /s
192.168.66.80:12000  server   voluntary_context_switches                                                       38618 csws           318.397 /s
192.168.66.80:7000   server   cpu_stime                                                                          115 ms               0.948 /s
...much data...
192.168.66.82:9300   counter  node_vmstat_pgpgout                                                                138.000000           1.140 /s
192.168.66.82:9300   counter  node_xfs_block_mapping_extent_list_insertions_total_sdb1                             1.000000           0.008 /s
192.168.66.82:9300   counter  node_xfs_block_mapping_extent_list_lookups_total_sdb1                                5.000000           0.041 /s
192.168.66.82:9300   counter  node_xfs_block_mapping_reads_total_sdb1                                              2.000000           0.017 /s
192.168.66.82:9300   counter  node_xfs_block_mapping_writes_total_sdb1                                             1.000000           0.008 /s
192.168.66.82:9300   counter  node_xfs_read_calls_total_sdb1                                                       3.000000           0.025 /s

These are all the yb_stats statistics, which are the counter-based and counter- and sum-based statistics per server, and the node_exporter counter based statistics for all the specified nodes.

yb_stats snapshot-diff

The above ad-hoc mode doesn’t store anything. It’s typically used when a testing cycle is performed in a controlled environment, which means that storing every begin and end situation of testing would mean too much data. In most of the other situations, and especially in client situations, it’s important to understand the entire picture. You also don’t want to continuously fetch data, especially if that requires asking someone else to do that. In such situations, storing a snapshot will get you all the available information, which you can use and saves you from needing the cluster to be available to look it up. You might also want other people to look at the data who may not be available or don’t have access to the cluster you are looking at. All the data in the snapshot is persistent. It does not change.

Once two or more snapshots are taken, a difference overview can be obtained using the yb_stats –snapshot-diff switch.

% ./target/release/yb_stats --snapshot-diff
  0 2022-11-06 14:57:19.801329 +01:00
  1 2022-11-06 15:00:08.100975 +01:00
  2 2022-11-06 15:02:09.157553 +01:00
  3 2022-11-06 17:00:14.007897 +01:00
  4 2022-11-07 22:17:35.932471 +01:00
  5 2022-11-08 14:51:56.669687 +01:00
  6 2022-11-08 14:54:42.636357 +01:00
  7 2022-11-08 15:05:27.319409 +01:00
Enter begin snapshot: 6
Enter end snapshot: 7
192.168.66.80:12000  server   cpu_stime                                                                         4236 ms               6.573 /s
192.168.66.80:12000  server   cpu_utime                                                                         2277 ms               3.533 /s
192.168.66.80:12000  server   glog_info_messages                                                                  95 msgs             0.147 /s
192.168.66.80:12000  server   involuntary_context_switches                                                        38 csws             0.059 /s
192.168.66.80:12000  server   server_uptime_ms                                                                644483 ms             999.983 /s
192.168.66.80:12000  server   threads_started                                                                     32 threads           0.050 /s
192.168.66.80:12000  server   threads_started_thread_pool                                                         32 threads           0.050 /s
192.168.66.80:12000  server   voluntary_context_switches                                                      207444 csws           321.871 /s
192.168.66.80:7000   server   cpu_stime                                                                          698 ms               1.083 /s
...
192.168.66.82:9300   counter  node_xfs_vnode_reclaim_total_sdb1                                                   12.000000           0.019 /s
192.168.66.82:9300   counter  node_xfs_vnode_release_total_sdb1                                                   12.000000           0.019 /s
192.168.66.82:9300   counter  node_xfs_vnode_remove_total_sdb1                                                    12.000000           0.019 /s
192.168.66.82:9300   counter  node_xfs_write_calls_total_sdb1                                                    244.000000           0.379 /s

NOTE: The difference between using the ad-hoc mode and snapshot-diff mode is that the ad-hoc mode does not require stored files.

Further benefits of yb_stats snapshots

When using snapshots, more data is stored, for which it doesn’t make sense to provide it in a difference overview—shown above—such as logs or RPC network connections. The complete list of additional snapshot data is listed above.

For example, after upgrading to a YugabyteDB cluster, you can validate that all components were successfully upgraded by using a snapshot to print the version information to validate that all cluster servers show the version that is expected:

% yb_stats --print-version 7
hostname_port        version_number  build_nr   build_type build_timestamp          git_hash
192.168.66.82:9000   2.15.3.0        231        RELEASE    22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8
192.168.66.82:7000   2.15.3.0        231        RELEASE    22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8
192.168.66.81:9000   2.15.3.0        231        RELEASE    22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8
192.168.66.81:7000   2.15.3.0        231        RELEASE    22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8
192.168.66.80:9000   2.15.3.0        231        RELEASE    22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8
192.168.66.80:7000   2.15.3.0        231        RELEASE    22 Oct 2022 19:21:11 UTC 981ed35d6a47730ee663d8b14d541ba264dd3bc8

Many more print options exist, such as logs, masters, RPC ports, gflags, etc.

Learn more about yb_stats

There are many more options, such as reducing the output by filtering on hostname, statistic name, or table name; adding gauge type statistics; enabling details to split the per table and tablet statistics to their table and tablets; and many more –print options.

If you want to learn more about yb_stats:

Github yb_stats repository

November 24, 2022

yb_stats

An introduction to yb_stats

Related Posts

Explore Distributed SQL and YugabyteDB in Depth