Blog
Nothing’s better than the real thing: true cloud-native
Telco CTOs and CIOs, use these questions to figure out which vendors are doing cloud native right.
Every time I see news of a public cloud outage a little part of me groans on the inside because it means a slew of news articles will be written about how you can’t trust telco workloads to the public cloud.
For example*:
Reading through articles like this, I invariably find myself shaking my head at the lack of understanding about the public cloud. While the authors like to write about “the public cloud” being “down,” they fail to make the key distinction that it’s not the *entire* public cloud (or even all of the services) suffering from loss of service. Readers need to understand that during “public cloud outages” other public cloud regions are still up and running like nothing’s happened. So it’s time to set the record straight about outages in the public cloud.
In the early days of the public cloud (I’m talking 2006–2012), there were significant outages, affecting the services of global companies like Netflix and Apple. Remember the 2011 Amazon Elastic Compute Cloud (EC2) outage? Or the day that Google Talk, Twitter, and Azure all went down at once? Back then, I would have agreed that the public cloud wasn’t ready for telco workloads. But since then, the public cloud is ready for carrier-grade workloads. So, what’s changed?
In short: the hyperscalers. They have been working hard and spending BIG BUCKS to improve the resiliency options they offer to customers. Because they serve a huge variety of customers with a wide variety of workload needs, they allow each customer to configure resiliency according to their workload’s needs. Which means resiliency is now YOUR problem.
The hyperscalers employ what they call a shared responsibility for resiliency. For example, AWS commits to the resiliency of its infrastructure—the hardware, software, networking, and facilities that run the services—and it makes commercially reasonable efforts to ensure these services meet or exceed its contractual service level agreements (SLAs). But everything else becomes the responsibility of the customer. For example, a service such as Amazon EC2 requires the customer to perform all of the necessary resiliency configuration and management tasks. Customers are responsible for managing resiliency of their data including backup, versioning, and replication strategies—not AWS. Azure has a similar approach, as does Google Cloud.
Net net: you are in control of your resiliency strategy. You decide if your workloads need to run across multiple Availability Zones (AZs) in a single region as part of a high availability strategy, or not. You can design a multi-AZ architecture if you need to protect workloads from issues like power outages, lightning strikes, tornadoes, earthquakes, or other disasters. Depending on the workload criticality, you can use more of the resiliency options, or less. The benefit of this approach is that you have all of this built-in and ready to use at a moment’s notice as a service of the public cloud. Use more options and your workload will be more resilient, but you’ll spend more. On the other hand, if you don’t need it, you don’t pay for it. As they say—Your Mileage May Vary (YMMV).
Figuring out your approach to each workload will be key to your move and your costs in the cloud:
Table 1: Cost vs. Difficulty of DR Strategies
Disaster Recovery (DR) Strategy | Cost | Difficulty level |
One availability zone, one region | 💰 | 🍰 |
Multiple availability zones, one region | 💰💰 | 🛠️ |
Multiple regions | 💰💰💰💰 | 🛠️ 🛠️ |
Multiple hyperscalers | 🏦 | 😩 |
The ability to avoid outages is available to public cloud customers. Next time you read a write up about an internet service going down that blames the public cloud provider, remind yourself what it really means is that affected customers decided to not spend money for resiliency, or had an operational snafu and didn’t set things up right. But don’t blame the cloud vendors. They’ve invested hundreds of billions of dollars building out 69 regions in 39 countries to give you the power to avoid outages.
So, if you’re putting workloads on the public cloud (as you should be), then avoiding outages is your problem. Luckily for you, public cloud technology continues to become more and more resilient and outages are becoming less significant. This shit is more than battle tested, so use it to your advantage and GO CLOUD!
* Where are the articles about the on-premise outages — Rogers and KDDI — where their whole systems went down for most of a day to several days? (Service restoration varied for Rogers customers from 19 hours to several days. KDDI’s outage lasted two days.) Had those services been in the public cloud, the telcos might have been able to manage a failover to another availability zone or region in seconds. Tell me again: which approach is riskier?
Recent Posts
"*" indicates required fields