Looking back, July 19, 2024, was challenging for customers using Microsoft Azure or Windows machines. Two major outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. These two outages affected many people and put many businesses on pause for a few hours or even days.
The overlap of these two issues was a nightmare for
travellers. In addition to blue screens in the airport terminals, they could
not get additional information from the airport website, airline personnel, or
the support line because they were affected by the outage in the Central US
region or the CrowdStrike outage.
But what happened in reality?
A faulty CrowdStrike update affected Windows computers
globally, from airports and healthcare to small businesses, affecting over 8.5m
computers. Even if the Falson Sensor software defect was identified and a fix
deployed shortly after, the recovery took longer. In parallel with CrowdStrike,
Microsoft provided a tool that helped customers fix the issue.
Around the same time, customers running their businesses in
the Central US Region started to notify availability, connectivity issues, or
service management failures. The outage affected computation, database
services, and even Microsoft 365, Dynamic 365, and Microsoft Entra (AD) because
of the cross-dependencies. The cause of the Microsoft Azure Central US outage
was related to a security component of Azure Storage that created a domino
effect.
The Microsoft team's reaction was fast, and customers who
followed the reliability best practices recommended by Azure Well-Architecture
Framework (WAF) had a minimal impact on their business. After the first minute
of the outage, Microsoft detected unhealthy Azure SQL DB nodes, triggering
further investigation and mitigation actions. The service monitoring detected
drops in Azure VM availability after 26 minutes. After less than 2 hours, all
Azure SQL DB instances with geo-failover configured were restored in another US
Region.
Customer with a minimal impact on their business were the ones that:
- Implemented a strong DR
strategy across all their ecosystem (e.g., Azure SQL DB, Azure Cosmos DB)
- Reviewed and implemented
the reliability strategies of Azure WAF according to their business needs.
- Kept DR playbooks
up-to-date and ran drills every few weeks or months.
- Implemented an
active-active or active-passive approach to the computation layer.
- Dashboard and alerts of
Azure Services and payloads were configured.
- Right people from the
support team were notified by alerts and could take further actions.
- IaC, automation and
up-to-date deployment playbooks
- Backup and copies of data
in a separate region
These two outages showed how fragile the digital ecosystem is. The butterfly effect generated by a slight disturbance of Azure Storage's security layer almost put many businesses on hold. It was a real test for companies without a strong reliability strategy. Most affected were those who relied only on the Microsoft Azure Central US Region without a working DR strategy. The lack of infrastructure-as-code (IaC), automation and a working disaster recovery playbook made the situation harder. This blackout underscores the importance of robust disaster recovery strategies and the need for regular testing and maintenance.
Is multi-cloud and hybrid cloud the future because of it?
Running the same system in multiple cloud vendors is
challenging. It comes with extra cost and longer timelines and additional
complexity, making it harder for a company to use each cloud vendor's key
features and capabilities. This drastically increases the running and
maintenance costs, making them less flexible and attractive. Even from the
reliability point of view, things are more complicated and harder to maintain.
The hybrid and multi-cloud ecosystems are only the
beginning. The lack of options related to database, storage, and computation
products that are platform-agnostic but optimised to run on multiple cloud
vendors makes it difficult to find the right balance between value and
complexity. This market is expected to grow and become more platform-agnostic
from the customer side and more aware of vendor-specific capabilities at the
platform layer.
Some specific industries and services require a hybrid or multi-cloud approach from the beginning. Payment, transportation, and healthcare services should rely on a strong reliability strategy involving at least two vendors—multi-cloud or hybrid cloud- for specific business streams.
Multi-cloud is too expensive, so what should I do?
The Microsoft Azure outage on July 19th affected
organisations with a poor reliability strategy. In most situations, the
business applications ran only in the Central US, and no disaster recovery
strategy existed. Best practices related to reliability were not implemented,
disaster recovery playbooks were outdated, and the support team was not ready
to manage such situations. The lack of IaC and automation added additional
complexity to the restoration procedure and increased the complexity.
In most cases, the situation could have been avoided if customers had followed Microsoft's DR recommendations for each cloud service. Following the DR strategies of Microsoft Azure Well Architecture Framework and aligning with business-specific needs ensured that data and workloads were ready for use in a different Azure Region. High availability can be accomplished using active-active or active-passive strategies.
DR strategies that rely on data backup in a different
region, automated restoration playbooks and IaC can provide a strong DR
strategy with a potential downtime of 1-2 hours. If your business can afford to
be offline for a few hours, this strategy can provide a good balance between
reliability and cloud infrastructure costs.
How can I improve my reliability score?
An assessment is required to improve the reliability of your
solution. In addition to a readiness assessment of the infrastructure,
applications, data, security, and compliance policies, the assessment needs to
identify the dependency graph, the scaling strategy, and the automation
maturity level. Understanding the business context and the impact of
availability on the business is crucial.
The main drivers should be the business needs, aligned with
compliance and industry-specific characteristics. The recipe for success is a
balanced reliability strategy, where acceptable downtime is considered and
mitigated accordingly.
The future of hybrid & multi-cloud
The initial momentum after July 19th was on multi-cloud
strategy. A multi-cloud strategy only partially protects you and comes with
additional risks and complexity. Indirect dependencies such as IAM,
virtualisation, and storage platforms can be your platform's Achilles’ heel.
External dependencies are another essential factor that
needs to be considered that can affect business continuity.
The first step in the high-reliability journey is to follow each cloud provider's recommendations and align them with the specific business needs. The second layer of protection is an active-active or active-passive strategy across the same cloud vendor. Expanding it across a multi-cloud or hybrid approach is the third layer of protection, which provides high reliability but is more complex and expensive.
Remember that even a multi-cloud approach using the same
platform is not bulletproof. An issue like the one reported by CrowdStrike
could affect the selected multi-cloud platform, creating a digital blackout
across your systems.
A mitigation plan and recovery playbook are mandatory and make the difference between success stories and 404 businesses.
Comments
Post a Comment