Looking back, the 19th of July 2024 was challenging for customers using Microsoft Azure or Windows machines. Two significant outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. In today's article, we’ll focus on the Microsoft Azure outage.
These two outages affected many
people and put many businesses on pause for a few hours or even days.
A faulty CrowdStrike update affected Windows computers
globally, from airports and healthcare to small businesses, affecting over 8.5m
computers. Even if the Falson Sensor software defect was identified and
a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike,
Microsoft provided a tool that helped customers fix the issue.
Around the same time, customers running their businesses in
the Central US Region started to notify availability, connectivity issues, or
service management failures. The outage affected services such as Azure VMs,
Azure Cosmos DB, Azure SQL, and even Microsoft 365, Dynamic 365, and Microsoft
Entra (AD) because of the cross-dependencies.
The cause of the outage was a security layer of storage
scale units that accept disk operations only from specific ranges of network
addresses. The update and propagation mechanism were faulty, and an incomplete
list of allowed ranges of network addresses was broadcasting, causing some
storage scale units not to be accessible from computing services in Azure
Central US. The propagation mechanism did not check the issues of Azure VMs. It
continued to deploy across multiple availability zones, causing a regional outage,
even for customers that followed availability zone best practices. This complex
chain of events underscores the intricate nature of cloud infrastructure and
the potential for widespread impact when a single component fails.
The impact was much lower for the customers who used failover and DR strategies across their Microsoft Azure ecosystem. Azure Cosmos DB customers running in a multi-region write approach were not affected by the outage in contrast with customers running in a single region configured for Central US that was wholly or partially affected. A similar experience had Azure SQL DB customers with automatic failover policies configured, where the geo-secondary was elevated as the new primary after the incident.
After the first minute of the outage, Microsoft detected unhealthy
Azure SQL DB nodes, triggering further investigation and mitigation actions. The
service monitoring detected drops in Azure VM availability after 26 minutes. After
less than 2 hours, all Azure SQL DB instances with geo-failover configured were
restored.
The Microsoft team's reaction was fast, and customers who
followed the reliability best practices recommended by Azure Well-Architecture
Framework (WAF) had a minimal impact on their business. The Azure WAF is a set
of design principles and best practices for building and operating reliable,
secure, and efficient systems in the cloud. It provides guidance on how to
design and implement your cloud solutions, and following its recommendations can
significantly improve the resilience of your systems. Most customers affected
by the outage lacked a DR strategy and relied on the services provided only by
the Central US Region.
Customer with a minimal impact on their business were the
ones that:
- Implemented
a strong DR strategy across all their ecosystem (e.g. Azure SQL DB, Azure
Cosmos DB)
- Reviewed
and implemented the reliability strategies of Azure WAF according to their
business needs
- Kept
DR playbooks up-to-date and ran drills every few weeks or months
- Implemented
an active-active or active-passive approach to the computation layer
- Dashboard
and alerts of Azure Services and payloads were configured
- Right
people from the support team were notified by alerts and could take further
actions
- IaC,
automation and up-to-date deployment playbooks
- Backup
and copies of data in a separate region
Digital backout
A large number of customers suffered from a digital blackout
, a term used to describe a complete loss of digital services. Most affected
were those who relied only on the Central US Region without a working DR
strategy. A group of customers also had a DR strategy implemented but was not
tested. The playbook was not kept up-to-date, and there were no mechanisms to
notify the right people in the organisation that there were reliability issues.
This blackout underscores the importance of robust disaster recovery strategies
and the need for regular testing and maintenance.
Customers who lacked of IaC and deployment playbooks were
the most affected. They could not restore the cloud environment even with
access to the data backups, storage and payload images. Their systems were unavailable
for even more than a few hours.
The organisations had the perfect recipe for disaster in the
rush to go live as soon as possible, keeping development costs minimal and
putting the support team on a secondary level.
As clouds expand and become part of our day-to-day lives, the
odds for things to go wrong increase.
The same DR and failover strategies we use for on-premises
systems must be used for cloud solutions. Microsoft Azure and other cloud
vendors provide us with the tools and mechanisms to configure them much easier,
but they are not free. They come with an additional cloud infrastructure cost
as they would come for an on-premises approach.
To keep the light on even on blue days like this, you need
to ensure that IaC, automation, alerts, and WAF are part of your company's DNA.
Comments
Post a Comment