Is the sky falling? Or, how to prepare for AWS outages.

Self-hosting or multi-cloud are knee-jerk reactions to every cloud outage. Start with multi-region, if you must!

The cloud is unreliable! Unreliable! The press, and the pundits of LinkedIn, jumped to that conclusion not even 12 hours after the Amazon Web Services outage of October 20, 2025, but their alternatives are no better.

The Best Fallback Is Pre-Technological

My dentist and his amazing team computerized their office long ago. Yet, before leaving for the day, the office manager still prints the next day's schedule. If the worst happens, patients know the times of their own appointments, and staff can follow the printed schedule.

At first glance, this idea might not seem scalable, or relevant to a global business. It is. If an airline swept passenger manifests 24 hours before each flight and printed them at each local "station" (airport), the vast majority of passengers could board even if the reservation system went down, the network link failed, or a software update rendered the computers at the gate unusable (hello, CrowdStrike!). Only rebooking and last-minute tickets, few in number, would require heroic intervention.

For a call center, an interactive Web site, or a mobile application, anticipate the need for an externally updatable status recording or "banner" message. This is always the first feature that should be implemented. At least you'll be able to reassure your customers when something isn't working.

For managers and employees, have alternative tasks in mind so that the day isn't lost.

You Lack the Staff, Experience, and Extra Equipment to Host It Better Yourself

Self-hosting is the comforting, knee-jerk response to any cloud computing outage. Is your engineering department as big, let alone as highly specialized, as AWS's, or Microsoft Azure's or Google Cloud's, for that matter? Do you have umpteen extras for each piece of equipment, and are they spread around the country, or the world? Probably not.

Without getting into physical dependencies like redundant power feeds, consider the knowledge and experience of your software engineers. Spending an hour in the Amazon Builders' Library ought to convince you that reliability is a full-time job, quite apart from running your primary business.

Learn to Steer the Ship You're On

Multi-cloud is another knee-jerk response when any one provider goes down. Is there empirical evidence that one cloud provider is more reliable than another? Does your engineering department have the staff and experience necessary to get the full benefit of one cloud provider, let alone two?

On February 28, 2017, a major outage affected AWS's us-east-1 region. At the time, services in other regions were still heavily dependent on us-east-1. I remember not being able to create new Relational Database Service databases from snapshots in us-west-2, even though the snapshots had been archived there and RDS's compute and disk resources were located there. It turned out that AWS was keeping critical metadata in us-east-1.

During the October 20, 2025 AWS outage, which again affected the us-east-1 region, I happened to be up late working. My workload was in us-west-2. I was able to create, update and delete containers in Elastic Container Service, push images to and pull images from Elastic Container Registry, and create, update and delete entire private networks (Virtual Private Clouds). My application worked normally.

The one thing I could not do was call iam:GetPolicy . I resorted to Terraform resource targeting to avoid referencing infrastructure I wasn't changing. This worked beautifully. Identity and Access Management roles and policies are non-regional. Though it's likely that IAM still depends on us-east-1, the AWS ecosystem has become much better isolated over the years.

Before you consider multi-cloud, consider a thorough multi-region design on AWS or your cloud provider of choice; they all offer a similar idiom. Having said this, keep in mind that multi-anything is hard! It's a true distributed computing problem.

Multi-Region Challenge 1: Duplicate Your Infrastructure

The first challenge is duplicating your infrastructure. It wasn't until June, 2025 that the Terraform AWS Provider gained enhanced multi-region support. Before then, writing multi-region HashiCorp Terraform required syntax repetition, leaving templates and configurations quite brittle.

AWS CloudFormation always facilitated region-independent infrastructure-as-code templates. Those of us who learn and use AWS idioms from the start have the advantage.

For example, Terraform cheerfully invites you to write for_each loops. You would loop over regions, defining a separate Key Management System encryption key in each region, and referencing specific source and destination keys as you re-encrypted data when it crossed a region boundary.

CloudFormation, less sophisticated on the surface, compels you to adopt multi-region encryption keys. Define the "child" keys of your multi-region encryption key in a CloudFormation StackSet, which can target multiple regions without a loop. In the CloudFormation template for your workload, format the encryption key parameter as ACCOUNT:key/mrk-ID instead of a fully-qualified Amazon Resource Name like arn:aws:kms:us-east-1:001122334455:key/01234567-89ab-cdef-0123-456789abcdef . You can deploy your workload in whatever region(s) you like, with no need to track different encryption key identifiers.

Multi-Region Challenge 2: Make the Regions Independent

The second challenge is making resources in different regions truly independent.

Can you build your application to use an independent database in every region? Can you minimize the information that must be stored centrally? Can you continue some processing, perhaps with cached data, while access to the central database is cut off? What's the reconciliation process, once access to the central database has been restored?

Multi-Region Challenge 3: Shift Traffic Between Regions

The third and final multi-region challenge is being able to shift traffic. Being certain that a region has failed is hard. Re-routing traffic is hard. Maintaining surplus capacity ("warm" or "hot" standby) is expensive, and scaling one region fast enough to handle two regions' worth of traffic is hard.

If uptime is essential to your business, that is, if the money saved or the revenue sustained by continuing to operate through occasional cloud outages exceeds the cost of the extra engineering work and the extra infrastructure, then start with full multi-region architecture in your existing cloud.

Multi-region architecture is part of AWS's Well-Architected Framework, for example. "REL10-BP01 Deploy the workload to multiple locations" provides, "Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions [emphasis added]."

My claim is that you can address the first multi-region challenge quite cheaply, at the beginning, long before you need to instantiate your infrastructure templates in multiple regions. Use CloudFormation, or update every Terraform resource definition for enhanced region support and insist that your Terraform module suppliers do the same. (Outdated Terraform modules demonstrate that "free" software isn't necessarily free.)

Self-hosting isn't likely to be more reliable than cloud. Every technology or provider fails occasionally. There's little to gain in terms of reliability from the work of implementing multi-cloud before you've figured out multi-region in your existing cloud.

Feel free to leave a comment on the LinkedIn version of this article!