Navigating IO Issues in Cloud Infrastructure: Insights and Solutions
Being an open-source, cloud native database, YugabyteDB is often utilized to run in a cloud environment. Despite the many benefits, running in a cloud environment is not without issues and risks. In this blog post, we will provide some insight into our experience with IO issues, our analysis of these issues, and how we resolved them.
Note that these IO issues were caused by the cloud infrastructure, mostly out of sight of the cloud virtual machine, again highlighting that a move to the cloud is not risk-free, and that your issues may not be immediately found and fixed by your cloud provider.
This story started when I was asked to look into a high latency IO issue to see if I could reproduce a high latency IO situation. This sounded simple. It should be easy to reproduce because all public clouds impose IO limits on the host level and the disk level. So, when you know and understand these limits, all you need to do to increase IO latency is to oversubscribe on these limits. These limits are generally determined by IOPS (disk IOs per second) and MBPS (bandwidth), whichever applies first.
To replicate the issue accurately, I created an environment with the same cloud machine size and disk size as the one experiencing high IO latencies. I set up my load generator and configured it to generate more YugabyteDB requests than the cloud machines could handle. If you oversubscribe on IO limits, the requests will be throttled by “the cloud.” These concurrently active requests become visible as multiple IO requests in flight, also known as IO queueing, or “queue depth”. Voila – easy!
But, it turned out this was not the problem. Yes, there was queue depth with the original high latency issue, which actually triggered first line support from the cloud provider, but it didn’t seem to be caused by IO pressure. This looked weird to me. I needed more information and set out to gather that with a helpful colleague.
It turns out that the issue was triggered by the client noticing higher (or too high) latencies in the application using the YugabyteDB database. These higher latencies were filed as an issue with Yugabyte support, and our support team investigated the issue. They found messages in the logfile of our tablet server, which is the principal distributed database layer of YugabyteDB. These messages were warnings saying:
W0405 13:36:25.756516 23075 log.cc:933] Time spent Fsync log took a long time: real 3.647s user 0.000s sys 0.000s
It seemed that we were not doing a very high amount of IOs, so our support team logged a ticket with the cloud provider. The cloud provider’s first line of support looked at their statistics and found “queue depth”. This generally means too many IO requests (IOPS) or too much bandwidth is being used (MBPS). The cloud provider therefore bounced back the request saying to “fix the application”. This was when I realized that we needed to drill deeper into the issue.
At the same time, different YugabyteDB experts were coming to the conclusion that increased latencies are simply part of using a certain disk quality, because clouds share a lot of infrastructure between all the virtual machines that are running. During a conversation with senior infrastructure support people from the same cloud vendor, they mentioned that since they were not using the highest tier disks, there could be some high (or higher) incidental IO latency outliers.
So, we collected more data about what was actually going on. The logfile of the Yugabyte tablet server (discussed above) was crucial to identifying the eventual issue. After carefully investigating the logs, we discovered sporadic occurrences of the
fsync message (see above) on a single machine within a cluster at a time. These messages persisted for approximately one minute before disappearing, subsequently resolving the problem.
By investigating the
fsync messages of multiple individual incidents, we found that when the problem occurred, it resulted in higher latency for the client application; the warning messages appeared during a timespan of approximately one minute. The latencies in the
fsync messages consistently indicated a similar range, spanning from around 600 milliseconds to 60 seconds for each of the separate issues. The indicated
fsync latencies of over a second significantly exceeded the latencies we were told could happen by our cloud provider. Having groups of 20 to 80 messages was not accidental; it was a structural issue during that approximate one minute period.
Next, we collected the actual Prometheus data from the affected machine during the specific timeframe. Our YugabyteDB management suite leverages Prometheus as it offers consistent historical statistics across various cloud and on-premises installations. Our Prometheus installation retrieves statistics every 15 seconds, providing a more detailed level of granularity compared to sar/sysstat (10 minutes) and most cloud-based statistics gathering options (typically at one-minute intervals).
The statistics that Prometheus fetches come from the node exporter for the operating system data, and the operating system data it exposes are statistics from the Linux /proc filesystem. Many of these statistics are counters. We looked at the statistics to try and understand the IOPS and MBPS figures. To calculate both IOPS and MBPS figures from counters you have to obtain the counter values from two periods of time. You then subtract the values to get the difference over that time, and then divide by the time to get the average IOPS or MBPS figure.
The reason for explicitly mentioning this, is that it means that for the period of time that an IOPS or MBPS figure is calculated, when the number of requests is higher than one, the figure you get is an average, not an actual single request figure. This is important to realize if you have a theory based on Prometheus data.
However, by default, we fetch statistics with a reasonably high frequency: 15 seconds. This should at least give us an indication of whether there was any IO pressure from the YugabyteDB database—even with the caveats mentioned before. By looking at the actual machine IOPS and MBPS figures and comparing them with the machine and individual disk limits, we found we were way below any of the MBPS and IOPS limits.
fsync logging from YugabyteDB, plus detailed statistics from Prometheus from the same machine, we presented our case to the cloud vendor’s first line support again. We informed them about the incidents occurring on single machines/virtual machines, where the database experienced latencies ranging from half a second to one minute for approximately one minute. Additionally, the operating system data indicated that we were far from reaching the IO limits.
Their initial response was that while true, there was still the queue depth, which according to the protocol of the cloud provider meant that too much IO was submitted.
This is where it gets interesting!
Based on the data, we were not running into any of those limits at all. So, my hypothesis at this point was that outside of the Linux virtual machine, IO requests were being held (or blocked) and only being serviced at a very low rate. This means that a lot of requests were gathered, despite not doing IO requests at a high rate. Once the issue (for reasons unknown) is resolved, all these blocked requests suddenly run and therefore queue.
This was a theory, but also the only logical explanation I could think of that fit the actual figures. As a result, alongside some escalations, we moved to the higher support tiers.
First, we explained our case again with the new support team. They stated that they knew of no such issues, but they started discussing and looking into it with us. It is admittedly a difficult position to be in: identifying issues that you cannot see!
Quite logically, to understand and validate our findings, the cloud vendor’s support team wanted to have thread stack dumps of the actual issue (via /proc/sysrq-trigger). However, this was a significant challenge since the issues occurred on random machines, for periods too short to react to. Not to mention, this was a client-owned environment, so either the client had to perform the specialist task of generating stacks, or ask us to do it. Long story short: this was practically impossible to execute.
Then I remembered that Tanel Poder had created a light-weight always-on suite of tools called 0xtools. This suite contains a tool called xcapture. What xcapture does by default, is read the proc filesystem, gather some information from every active task and write that to a file.
Linux works based on ‘tasks,’ which can be either processes or threads. The YugabyteDB database tablet server is a threaded application, but given that xcapture works for tasks, it can also capture the state of threads. It captures multiple interesting specifics of a task, including a kernel stack backtrace.
Using a kernel stack backtrace, it’s possible to see at what function a task is executing, providing it’s in system (kernel) mode.
fsync is a system call for doing IO. Therefore we should be able to detect tasks waiting for
fsync. That is not all. Because it shows a stack, it would allow us to see the actual kernel function it is currently running or waiting in.
Therefore, if we inspect the Linux kernel source code, we can determine if that is actively doing something, or just sitting there waiting for the IO subsystem to respond. If it’s waiting for a response, that would prove my hypothesis that we are waiting due to a phenomenon that exists outside of the cloud virtual machine.
To achieve this, we needed to put 0xtools on each system and run the xcapture process on each machine to capture a virtual machine when it’s malfunctioning for that minute. But, this is in production. So, we had to perform running xcapture on test systems running a cluster first, to see if there was any performance impact and ensure it doesn’t cause any issues. Luckily, we didn’t encounter any issues when running xcapture on the test systems—or later in production.
It should be acknowledged that the cloud vendor support team saw the value in this approach, agreeing that this would prove the problem was outside of the virtual machine and that the problem exists in the cloud infrastructure. Additionally, the cloud vendor support team wanted iostat to be run at a high rate to see IO specifics at high granularity. Once this was run in a test environment, where the issue was not seen, these tools were used in production. We just had to wait for the problem to reoccur.
Given that no other changes were made besides the introduction of the new tools, it was inevitable for the problem to reoccur. And indeed, it did. As a result, everyone was eager to look at the results!
The first thing we found was that the issue was not unique to
fsync, which waits for writes to finish. It was also happening to reads that needed to perform physical IO. The reason we didn’t spot this was because we don’t have outlier timing for reads. So, a read will not generate a warning message even if it’s waiting for more than 50 milliseconds (the warning threshold for
We then looked at the specific functions inside the read and write (
fsync) kernel stacks. When looking at these, we could clearly see that these were indeed where a read or write waits for a storage response. It therefore proved that the problem lay outside of the virtual machine; it lay somewhere inside the cloud infrastructure.
The one second
iostat results also upheld my hypothesis that when the problem resolved itself, all the IOs that had stalled were released, providing a single huge surge of requests which caused the queue depth to go high. That was what had led to the initial first line support request being bounced as an application issue.
Our indisputable evidence, demonstrating that the problem lay outside the virtual machine, compelled the cloud vendor to scrutinize the components of the problematic machine used during the incident. Although they did not disclose any of the details, the vendor acknowledged identifying an unpatched storage unit and another issue, for which a fix was implemented. Furthermore, they discovered issues on other virtual machines that necessitated resolution. The cloud provider’s subsequent updates were prompted by our demonstration that the issues did not originate from our side, thus compelling them to closely examine their own machines and uncover multiple issues.
The takeaway is crystal clear: by diligently examining the issue, gathering accurate and relevant data, we were able to prove that there were no issues with our database or the virtual machine. We were able to successfully identify and resolve the problem our customer was facing. This serves as a reminder to consistently monitor your resources, even in the cloud, as complacency can lead to the false assumption that everything is under control and no issues can arise.