Reining in Peak Events and Freak Events With Distributed SQL
Technology has woven itself into the fabric of our daily lives in ways we could not have envisioned a decade ago. The best technology is invisible, quietly powering the data-driven products and services we have come to rely on. It streamlines our lives-reaching us where we are, granting us immediate access to the services we need, and accommodating traffic surges without a hitch to the user experience.
Until it doesn’t.
Unprecedented peak events and unpredictable freak events are increasingly pushing the limits of system scalability and resilience. Taylor Swift’s epic Eras Tour broke five records this year, becoming one of the highest-grossing concerts of all time, but along the way, it also broke Ticketmaster’s systems. Superbowl XLVII in February 2023 was the most-watched US-based telecast ever, drawing over 115.1 million viewers, despite concerns around streaming lags and service unavailability. Online sales during Black Friday have continued to climb annually, topping a record $9B in 2022. Every year there are reports of retailer sites going down, shopping carts that empty themselves, and payment processes that fail.
These peak events are now becoming more frequent, perhaps even expected, as consumers wholeheartedly embrace digital experiences. Additionally, freak events—whether they’re caused by adverse weather, operator-induced cloud outages, or facility fires—have increased not only in frequency but also in their disruptive capacity. Both peak events and freak events significantly challenge the limits of the technological stack responsible for delivering services.
Many organizations are doubling down on their business continuity and disaster recovery planning and scalability testing to ensure they can handle peak and freak events and avoid their moment of triumph turning into a PR disaster.
To deliver services that scale and never fail, enterprises have drawn inspiration from web-scale companies like Amazon and Google. They have embraced cloud native technologies, microservices, DevOps, site reliability engineering, and chaos engineering to weather the storms, both natural and manmade.
Every business-critical application that customers interact with talks to a transactional database that offers low-latency, highly concurrent access to data. However, traditional database management systems like Oracle, DB2, and SQL Server face significant challenges because they were not designed for the resilience and scaling needs of modern applications operating in dynamic environments. These needs are now spurring a seismic shift in companies as application developers move their workloads off traditional single-server RDBMSs to cloud native databases designed for modern application workloads.
Distributed SQL has emerged as a leading solution for businesses building modern cloud native applications. This category of cloud native distributed databases uniquely combines the capabilities of relational data modeling, ACID transactions, and strong data consistency that business applications rely on, with the resilience and scaling needs the modern world needs.
So, how can cloud native, distributed SQL databases like YugabyteDB help organizations fully prepare for peak events and freak events? Let’s take a look.
Peak events are not new. Successful ecommerce platforms and online retailers have years of experience putting plans in place for Thanksgiving and holiday shopping seasons. Streaming and broadcast vendors have learned to handle the peak demands of movie releases and sporting events. How do they do it? With extensive planning, rigorous testing, and rapid response when needed. However, in recent years, the peaks have gotten taller (“peakier” if you will), pushing systems beyond what they were originally built and tested for.
There is a limit to the number of requests that transactional databases can handle (throughput), the number of concurrent connections, and the total amount of data stored. Engineering teams have to work around these limits using techniques like connection pooling, smart data modeling, caching, and sharding (the process of breaking up database tables into smaller tables, each of which run on a separate database). A coordinator routes incoming requests to the database that has the data needed to process the request.
While these engineered solutions work in the short term, deploying and maintaining these workarounds becomes labor-intensive and cumbersome over time, increasing the fragility of the overall system and diverting engineering time away from other, revenue-producing activities. Caches work well for data that rarely changes but lead to consistency issues when data is frequently updated. Similarly, manually sharding data becomes ineffective if data in one of the shards becomes hot.
Distributed SQL databases are specifically designed for seamless scaling. They’re deployed on a cluster of servers that coordinate to present a single logical database to the applications. Behind the scenes, these databases automatically share the data across multiple nodes or servers. This sharding is transparent to applications. The shards are automatically replicated and distributed across the cluster for resilience, and they can easily be split if they become hot.
Distributed SQL databases can also seamlessly scale when needed, with no disruption and without having to overprovision resources upfront. Database teams can simply add nodes to their database cluster to scale horizontally or replace their current servers with more powerful servers to scale vertically.
For example, a leading global retailer uses a distributed SQL database to run its massive product catalog that comprises billions of mappings for over 100 million items and handles 100,000+ queries per second. During Black Fridays, the retailer’s website experiences a tenfold increase in activity. Despite the demanding conditions of these peak events, the retailer not only effectively manages but thrives, establishing itself as the top destination for shoppers seeking Black Friday deals.
Peak events take months of planning and preparation, but freak events like unprecedented ice storms, tornadoes, hurricanes, and wildfires – don’t give engineering teams much (if any) warning. BC/DR requires continuous and ongoing readiness for incidents that can occur at any time. Scientists studying climate patterns have confirmed that extreme weather events are becoming more severe. As organizations ramp up their use of public cloud infrastructure, they are discovering that cloud outages at the availability zone level (and even at a region level) are not that uncommon.
When it comes to traditional transactional database resiliency, engineering teams have largely relied on replication. Active databases are replicated to one or more standbys in other data centers or availability zones. If the active database fails, the standbys can quickly take over without any data loss.
Database replication solutions have evolved to offer sophisticated features and granular control over recovery point objectives (RPO) and recovery time objectives (RTO) in disaster recovery scenarios. However, database teams are tasked with the deployment, configuration, management, and ongoing testing of these standalone replication products. The operational complexity of maintaining a disaster recovery posture can overwhelm IT teams, leaving them unsure about whether their data will remain safe and available when (not if) there’s a cloud or data center outage.
Distributed SQL databases are engineered with built-in resilience. Database clusters can be deployed across data centers, availability zones, and even cloud regions. Data in the cluster is automatically and synchronously replicated across at least three servers so that if an availability zone or region fails, the data is still available with no data loss.
Some advanced distributed SQL databases go a step further, offering built-in asynchronous replication between two database clusters. Both synchronous and asynchronous replication are built into the RDBMS software, eliminating the need for bolt-on third party solutions. This gives engineering teams complete control over their DR solution.
The global retailer mentioned earlier was impacted by an unexpected ice storm, dubbed the Great Texas Freeze, which knocked out power for over 4.5 million homes and businesses in the region. Their public cloud data center was offline for four days. Even their backup generators failed. This retailer, which runs wholesale, discount department, and grocery stores, plus as an active ecommerce site, had no application downtime because it deployed its distributed SQL database cluster across three regions with synchronous replication across the regions.
There will be more frequent and extreme peak events and freak events in the years to come. There’s no wishing the genie back into this particular bottle and no chance of our dependence on technology slowing down. Engineering teams must not only be diligent and thorough in planning, preparation, and testing to combat this issue, but also choose the correct database technology solution that aligns with their business needs. The ideal solution should simplify readiness while providing seamless scalability and inherent resilience. In the realm of transactional databases, distributed SQL emerges as the optimal choice.
NOTE: This article first appeared on VMBlog.com on October 24, 2023.
- Top Global Retailer Modernizes their Online Shopping Experience (Customer Story)
- YugyabteDB is the Retailers’ Choice for a Modern Transactional Database (Solution Brief)
- Why eCommerce Websites Crash (Blog)