The Evolution of Database Operations: From Traditional Databases to Distributed SQL
Marc Hansen, a systems engineer from Fiserv, recently sat down with Karthik Rangatharan, CTO and Co-Founder of Yugabyte, to discuss his role at this top global fintech and payments company.
During their conversation, they chatted about the realities of working with both traditional databases and DBaaS to support business goals and objectives. They also discussed how cloud technologies, PostgreSQL compatibility, and automation affect DBAs’ approach to disaster recovery, observability, and database deployment.
Read the excerpt below for the main discussion takeaways, or watch the full interview.
I work in systems engineering at Fiserv, a global fintech firm, as part of their Hardware Services division, where we handle the shipment of point-of-sale equipment and supplies to merchants. This includes paper and other necessary supplies to print receipts for customer transactions. My current role involves overseeing the operations and implementation of database-as-a-service (DBaaS), where I work with different providers to identify ways to use these services within Fiserv.
Q: Let’s begin by discussing your work with DBaaS. Could you share your perspective on what this entails and what it means to you? In other words, how do you define database-as-a-service, and why is it important?
Database-as-a-service (DBaaS) is a way of providing databases using a set of point-and-click operations, allowing business units or development teams to quickly and easily obtain the necessary databases to support their applications.
This is in contrast to traditional methods of building and deploying a new database, which can take months to complete. DBaaS can provide a database within as little as 10-15 minutes, which helps to accelerate the development process and ensures that teams can quickly obtain the resources they need to support their applications.
First, you had to find suitable hardware, which involved determining where to place it within the data center and ensuring adequate space, power, network, cooling capabilities, etc. Then you had to acquire the necessary software and licensing to install it. Finally, the network had to be configured to allow for access by the relevant business units, and it had to be built out to accommodate maximum predicted usage.
DBaaS abstracts many of these complexities. Cloud technology has been instrumental in getting us to this point, removing the need for businesses to build out their own data centers, further streamlining the process. Ultimately, DBaaS aims to eliminate the need to wait for a dedicated database team to build and deploy a database, allowing users to obtain the resources they need quickly and efficiently.
Q: One of the benefits of the cloud is that it can run on commodity hardware. How can operational efficiency be achieved with commodity hardware in the cloud despite its tendency to fail more often?
Automation is the answer to improving operational efficiency for commodity hardware. With the potential for increased failure rates, automating the recovery process is critical. I remember—way back in the day—whenever something happened to your laptop, someone came in with a floppy disk or a CD-ROM. No matter the problem with your hardware, the technician threw in a disk and rebooted your machine. When the reboot was done, your workstation was completely remastered. This concept can be applied to the realm of servers. Rather than constantly patching a system, it may be more efficient to simply (and automatically) rebuild it.
The value of distributed SQL lies in its ability to deploy databases quickly on commodity hardware and scale workloads to handle the same amount as larger systems. The Oracle databases I’ve worked on were massive, running on AIX systems with anywhere from 64 to 128 CPUs. That is not commodity hardware. Commodity hardware is something simple like a couple of PCs or four CPUs.
Distributed computing can spread the load across multiple workstations. It solves fault tolerance and disaster recovery issues, which is a DBA’s nightmare. Distributed SQL also provides a solution to the issue of maintaining uptime while still being able to patch and upgrade systems.
Q: Do you anticipate an increase in the number of systems and databases to handle the loads as the number of microservices or applications increases?
The growth in the number of microservices or applications does result in increases across the board, including the number of databases and the load. With the widespread use of mobile devices, systems are expected to be up 24/7. Back in the day, everybody did their nine-to-five job. So what if a system went down at 7:00 PM? Nobody was there to use it. Now people are on their phones 24×7, so your system’s expected to be up no matter the time of day because somebody could be using it. So, when our load is 24/7, this is where we need automation to help us meet zero downtime.
Streamlining some of the more operational parts of a DBA’s job is also important. For example, I usually have more than one database to monitor. I could have 200. Therefore, looking at every system’s logs quickly becomes infeasible, especially when dealing with that many databases. Instead, a central point is needed to monitor multiple databases and identify the ones that require attention.
Q: How has your approach to observability changed in the world of distributed SQL compared to the traditional database world?
It really changes when you have to begin troubleshooting a problem your database is having. With distributed systems, logging into each node and running performance queries is no longer feasible. There are too many nodes. Instead, again, a central point is needed to monitor the overall health of the nodes and identify any unhealthy ones. From there, drilling down into a specific log may be necessary to investigate errors further. However, having a control point to oversee the entire system is critical for effective observability.
Q: Does the PostgreSQL compatibility of distributed SQL, specifically with YugabyteDB, simplify or complicate things for someone with Oracle experience?
It’s helpful. The compatibility makes it easier for DBAs and app dev teams who have worked with Postgres before to hit the ground running. It’s a small paradigm shift, so you can carry that knowledge forward and easily think of YugabyteDB as a Postgres database overall. You know what works well, and you also know what can hurt a Postgres database. You can apply that knowledge to YugabyteDB. This is especially helpful when management gives you a tight deadline, and you need to deliver as fast as possible with as little effort as possible. I’ve seen Oracle developers say, “Oh yeah, I’ve got this. I can go,” because Postgres and Oracle are similar enough. So, there’s only a small amount of learning time and trying to get up to speed.
Q: As a DBA and DBaaS professional, how do you handle Day 2 operations—tasks such as upgrades, security patching, and downtime notices—in the traditional world? How does it differ from distributed SQL?
In the traditional world, you need a window for upgrades or maintenance. Depending on the size of the application and the communication needs, you may have to notify your customer months in advance regarding the upgrade.
Then you could run into problems. You might have to postpone or cancel the start. So you are constantly working to determine your outage window.
With a distributed system, you can patch a cluster. You can take one down, bring it back up and see if it remains healthy throughout the process. Then you move on to the next. You’re able to work with the application team to identify and handle noticeable errors and keep the application running while upgrading the system. A change window is still necessary, and people must be notified before the work is done, but it’s much less expensive. The app will continue to run, even if there are errors during this time, which is always preferable to having it completely shut down.
Q: Managing maintenance jobs can be challenging, especially with the increasing number of applications and the pressure to complete the task quickly. And often, those two goals struggle to co-exist.
True. With Oracle databases, quick patches or PSU updates could be done in as little as 15-30 minutes. However, major upgrades, RMAN replication, etc., take much longer. During this time, there is a risk that something could go wrong and impact the system handing you a disaster recovery scenario that you now have to deal with.
Distributed systems, on the other hand, keep the application up and running throughout the maintenance process. As a result, you can work through any issues that arise without causing significant downtime. This is much more efficient than the old approach and helps avoid significant problems.
Oh yes. From a DBA’s perspective, I can—depending on the app—do something during business hours instead of having to do it at two o’clock in the morning. And that is always great.
Q: What are your thoughts on the future of distributed SQL? Where do you see things going, and do you have any specific requests?
Let’s revisit the scenario of patching. One suggestion would be to enhance the application’s ability to reconnect to a node dynamically in the event of an outage. Specifically, if one node goes down during a transaction, another node should be able to take over seamlessly. While some platforms claim to offer this capability, they need to prove that they can handle this kind of distribution throughout the entire stack.
It’s not simply a matter of presenting a unified database at the front end, where a single choke point handles the distribution of workloads. Instead, each node should be able to function independently but communicate effectively with others in its cluster throughout the transaction.
Q: As a DBA, what are your thoughts about things like auto-scaling, multi-tenancy, and the consolidation of workloads? Is there anything that you find exciting?
Just all of it, but let’s look at scaling.
From a cost-cutting perspective, scaling up or down during specific periods of the day or week when massive compute is not needed can provide significant savings. For instance, I used to manage a warehouse system. We were not Amazon. We were not shipping 24/7, so we only required massive compute between 5 AM and 7 PM (on the outside). Beyond these hours, we only needed compute to run some reports and billing, which was completed by 9–10 PM.
Scaling back during non-peak hours would have been ideal, and scaling up again at 3–4 AM to handle the load would have been beneficial. Auto-scaling becomes particularly intriguing when there’s no way to predict when a sudden influx of traffic will occur.
So, how do I scale up to handle a “crazy” load already happening? I can see, performance-wise, where my database is going; I just need to scale to match that.
It’s got to be there. I mean, we are in financial services. Security must be at every level, whether at rest or in transit. Additionally, it’s crucial to provide one’s own encryption key rather than having to trust someone else to do so. That way, in case of an emergency, I can pull my key with me, and the data remains encrypted.
But it’s more than just having encryption at all levels. It’s also essential to have strong authentication measures in place, such as multi-factor authentication for DBAs and authentication for both the database and control plane. It’s just everything.
Distributed SQL definitely has to be the future. Everybody’s moving in that direction on the applications side, and they’re asking for the database to catch up. So that’s where everything is going.