“YugabyteDB Managed” is now called “YugabyteDB Aeon”. To find out more, visit our launch blog.

Catching Up and a Conversation About YugabyteDB Managed

Juan Tellez

February 10, 2023

It’s been over a year since YugabyteDB Managed, Yugabyte’s DBaaS went live, and I realized that I hadn’t spoken to my friend Daniel since then. Daniel works for an online food delivery company that uses the open source version of YugabyteDB. I’ve been meaning to catch up with him and get his thoughts on both our product and on DBaaS in general.

After a few failed attempts, we finally managed to put a date on the calendar to meet at a local bar. Over drinks we engaged in the usual engineer banter—what’s up with Burning Man and making jokes about the other’s technology stack. You know—Python vs Go, AWS vs GCP. All good fun.

Then we addressed the elephant in the room, and Daniel started peppering me with questions. He started with a very direct one.

What’s in a DBaaS?

“So,” he began, “What’s really in a DBaaS? We are running five instances of YugabyteDB, a bunch of PostgreSQL, and a few other databases. I have some Ansible, Puppet, some Bash, a bit of Python, and I’ve got the whole thing glued up. I can create one of those bad boys fairly easily without too much work. So, what’s the deal with you guys and your DBaaS?”

Wow! Just one beer in, and he was coming in hot! I signaled the bartender for some help – two more pints please?!

I decided to start with the basics, “First we need a highly available control plane. The control plane is the thing we call every time someone goes in and creates a database—or changes the database in any way. Your scripts don’t need to be highly available. Ours do. They can never go down, and if we do have any planned outages, that’s in our SLA contract. In fact if we fail, it costs Yugabyte money.”

My friend got that point, “Oh!” he said, taking a gulp of his beer, while he considered how to come at me again. I knew I needed to stay on the offensive, so I kept going!

“In fact we run multiple instances of our API service, and we integrate with a cloud Pub/Sub to send tasks that are reliable. When we upgrade our API Service Java instances we roll them—one before the other, always compatible with each other—to guarantee that someone is always listening.”

Cluster Creation and Edits

He jumped in quickly, “So, how do you make sure your cluster creations and edits don’t fail during upgrades?”

“We have a two-level control plane architecture. We run YugabyteDB Anywhere in a headless mode under the hood. YugabyteDB Anywhere is structured to make operations idempotent. So, if anything fails along the way it can be restarted or resumed. It also makes our failure domains smaller. A failure in a single zone on a single cloud only has a partial effect on our mesh of global control plane agents. We use multi-zone Kubernetes clusters at each level, taking advantage of multiple replicas and easy K8s deployment to help with the components of our control engine.”

My friend didn’t wait long to resume his questioning.

Change Control and Quality Assurance

“But that’s not enough, is it? Whenever you make changes, it has the potential to be a royal mess? How do you change all that infrastructure without messing up the whole thing?” he asked.

“Infrastructure as code!” I jumped in. “The whole thing is anchored on GitHub, uses GitHub agents, Terraform, and Python. If we need to make a change, it’s a pull request!”

“Ok, ok, ok, so that helps you with change control, and makes your changes idempotent and reliable, I get it. And then an intern merges in some code and the whole thing comes down in one merge!!! Hahaha!”

I laughed and agreed, “In theory that could happen, but we don’t have that problem. We have great interns, and when they are not doing amazing things in our hackathons, they are merging code that goes into our DEV cloud. From our DEV cloud our QA processes—short, long, unit, UI, and long running cluster tests—are all backed by a QA team anchored in Sunnyvale and Bangalore. We promote the code to a STANDBY cloud where it goes through a whole set of QA testing, and finally to our Production cloud. We’ve done this twice a week, maintaining the machinery without making mistakes. When we do make a mistake, we hotfix PROD quickly and safely, but that is a rare occurrence.”

My friend then spun up his next question.

Security

“Your networking must be a royal mess, all those databases, in multiple clouds, I’m sure if an intruder came in, they would be able to make off with data from all those customers.”

I paused for a moment. “Our networking architecture is clean. We use redundant VPNs between the levels in the control plane. We use VPCs to separate customer databases from each other. We separate the control plane from the data plane, and we don’t route packets between them. We use VPC Peering and Private links to connect with customer applications running in the cloud and load balancers using firewalls with whitelisting to prevent unwanted IP addresses from reaching out to any database that has open ports. So, the networking is clean. We then have a world-class security team that contracts external pen testing three times a year, using different companies to ensure we get different types of whitehat attacks. We have a team working hard to certify our services and we are SOC2 and ISO 27001 compliant, and we are working on two other certifications while making fast progress.

Moreover, if anyone got access to one layer of our system they would have a hard time getting to other layers as everything is protected by firewalls. All systems are protected against escapes, and we are continuously adding improved monitoring and alerting systems to defend against attacks. With security you can never stand still.”

Daniel was ready,“But you can’t defend yourself against inside attacks!”

“We use a system called Teleport to protect all resources in our network. If you want access to anything you need to request permission from someone on call. That makes it less likely for an inside attack to succeed. If it does succeed everything is not only audited, but recorded.”

My friend thought about it and then prepared his next line of questioning.

A DBA vs. DBaaS

“I don’t know man, I think I’d rather have a sharp DBA in my team. Someone with a furrowed brow and decades of experience.

“Well that’s often what our customers are paying us for.“ I said, “As we are already inside YugabyteDB, we have insight into the best times to upgrade the database, which versions have the fixes you need, and the best time to start any upgrades. With YugabyteDB Managed, your DBA is actually a team composed of database developers, support engineers, QA leads, and SREs. They decide when to tell the BOT to start a fleet upgrade of our databases and together they make a heck of a DBA!

“What about OS’s, Certs, configuration files, and other annoying things like that?” Daniel asked.

“Our fleet upgrade BOT knows how to upgrade all those things, and we work hard to stay on the ball and keep the databases and the control planes up-to-date,” I responded. “Our CERT bot doesn’t even have to ask us, it has simple logic to see if you need a new Cert and just does it.”

I could tell my friend was looking for more objections, but this is all part of the fun. Software engineers are contrarians; they love to guess what things will break!

Support

Daniel ordered another drink and looked at me for a few seconds wondering how to demonstrate why what we were doing with our managed DBaaS offering could end in disaster. “Well, if something goes wrong with my database, I will probably not get any human support. You’ll just send me to a chatbot and good luck getting any real answers!”

“Actually, this is an area where I’m really proud of what we are doing!” I replied.

“We combine our support organization with a global SRE core, using skills from both teams and access to our database and QA teams to solve problems.

“Our database generally doesn’t crash, so an alert is almost never a do or die emergency. If we are alerted, it usually means one node is down, or the database is not performing optimally, or you can’t connect to it. We usually have time to correct things.

“If the system doesn’t immediately restart that misbehaving node, or load balancer, a human may issue a command to start it up more forcefully. Perhaps we’ll recognize some bug that has already been identified, or give log access to an engineer who can look deeply into the patterns and identify the problem.

“It is great to have our experienced support core available on call, combining with our skilled SRE team to try and figure out why the database has slowed down and how to speed it back up.

“Sometimes we will add nodes on our own if the software is not behaving as expected, in order to compensate for a potential software problem. Once we have resolved the issue we will restore the original configuration.

“The point is not to automate everything, just the things that benefit from automation. We will add AI in the right places as we see opportunities, but for now humans are driving the BOTs, not the other way around.”

Daniel wasn’t done.

Cost

“Well, you must be super expensive then!”

“Lucky for you, no,” I laughed, “We work really hard to get a configuration that works and to find cost-effective hardware combinations.

Because we support lots of databases we prioritize getting bulk deals and buy prepaid arrangements. So our prices are very competitive for a world-class distributed SQL database.”

I then took a step back, because I’m an engineer, but I’m starting to sound like a sales guy! I looked around the empty bar. It had got late, and the bartender was obviously hoping we were finished so he could close up!

My friend was thoughtful for a moment: “So it is clear you guys did a nice job, but how does all that help me? We have some apps running on AWS, a few others on GCP, yet others on bare metal in the closet.“

“Well we support both GCP and AWS, and more regions than I care to mention. You can place your database on zones near you and use VPC peering to connect to it. For your bare metal applications you can use the DNS address of your database, define the allow list for your applications address, and you can connect to it over the internet. Easy.”

I could see that my friend was beginning to reevaluate, so I pressed my advantage… “Look man, I don’t know what you guys need, but you should take a look at YugabyteDB Managed,” I said waving at the bartender for the bill.

“Well,” he said, “I really need to deploy a geo-distributed cluster.” He looked at me. I nodded. A moment passed. “That too?!”

“Yup. That too.”

“Ok man. You win,” he laughed, and walked out of the bar with a smile, and I guess, I had to take care of the bill.

Blast! Sometimes it doesn’t pay to win the argument!

NOTE: The SREs, engineers, sales people, QA engineers, support engineers, and bartendenders in this story may have a resemblance to real people, but they are entirely fictional. Any similarity is coincidental. The code is all real.

Additional Resources

February 10, 2023

DBaaS YugabyteDB Aeon

Catching Up and a Conversation About YugabyteDB Managed

Related Posts

Explore Distributed SQL and YugabyteDB in Depth