19th of July, the digital blackout

Looking back, the 19th of July 2024 was challenging for customers using Microsoft Azure or Windows machines. Two significant outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. In today's article, we’ll focus on the Microsoft Azure outage.

These two outages affected many people and put many businesses on pause for a few hours or even days.

CrowdStrike Outage

A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provided a tool that helped customers fix the issue.

Azure Storage Outage in Central US Region

Around the same time, customers running their businesses in the Central US Region started to notify availability, connectivity issues, or service management failures. The outage affected services such as Azure VMs, Azure Cosmos DB, Azure SQL, and even Microsoft 365, Dynamic 365, and Microsoft Entra (AD) because of the cross-dependencies.

The cause of the outage was a security layer of storage scale units that accept disk operations only from specific ranges of network addresses. The update and propagation mechanism were faulty, and an incomplete list of allowed ranges of network addresses was broadcasting, causing some storage scale units not to be accessible from computing services in Azure Central US. The propagation mechanism did not check the issues of Azure VMs. It continued to deploy across multiple availability zones, causing a regional outage, even for customers that followed availability zone best practices. This complex chain of events underscores the intricate nature of cloud infrastructure and the potential for widespread impact when a single component fails.

The impact was much lower for the customers who used failover and DR strategies across their Microsoft Azure ecosystem. Azure Cosmos DB customers running in a multi-region write approach were not affected by the outage in contrast with customers running in a single region configured for Central US that was wholly or partially affected. A similar experience had Azure SQL DB customers with automatic failover policies configured, where the geo-secondary was elevated as the new primary after the incident.

Limited or zero outage

After the first minute of the outage, Microsoft detected unhealthy Azure SQL DB nodes, triggering further investigation and mitigation actions. The service monitoring detected drops in Azure VM availability after 26 minutes. After less than 2 hours, all Azure SQL DB instances with geo-failover configured were restored.

The Microsoft team's reaction was fast, and customers who followed the reliability best practices recommended by Azure Well-Architecture Framework (WAF) had a minimal impact on their business. The Azure WAF is a set of design principles and best practices for building and operating reliable, secure, and efficient systems in the cloud. It provides guidance on how to design and implement your cloud solutions, and following its recommendations can significantly improve the resilience of your systems. Most customers affected by the outage lacked a DR strategy and relied on the services provided only by the Central US Region.

Customer with a minimal impact on their business were the ones that:

Implemented a strong DR strategy across all their ecosystem (e.g. Azure SQL DB, Azure Cosmos DB)
Reviewed and implemented the reliability strategies of Azure WAF according to their business needs
Kept DR playbooks up-to-date and ran drills every few weeks or months
Implemented an active-active or active-passive approach to the computation layer
Dashboard and alerts of Azure Services and payloads were configured
Right people from the support team were notified by alerts and could take further actions
IaC, automation and up-to-date deployment playbooks
Backup and copies of data in a separate region

Digital backout

A large number of customers suffered from a digital blackout , a term used to describe a complete loss of digital services. Most affected were those who relied only on the Central US Region without a working DR strategy. A group of customers also had a DR strategy implemented but was not tested. The playbook was not kept up-to-date, and there were no mechanisms to notify the right people in the organisation that there were reliability issues. This blackout underscores the importance of robust disaster recovery strategies and the need for regular testing and maintenance.

Customers who lacked of IaC and deployment playbooks were the most affected. They could not restore the cloud environment even with access to the data backups, storage and payload images. Their systems were unavailable for even more than a few hours.

My recommendation

The organisations had the perfect recipe for disaster in the rush to go live as soon as possible, keeping development costs minimal and putting the support team on a secondary level.

As clouds expand and become part of our day-to-day lives, the odds for things to go wrong increase.

The same DR and failover strategies we use for on-premises systems must be used for cloud solutions. Microsoft Azure and other cloud vendors provide us with the tools and mechanisms to configure them much easier, but they are not free. They come with an additional cloud infrastructure cost as they would come for an on-premises approach.

To keep the light on even on blue days like this, you need to ensure that IaC, automation, alerts, and WAF are part of your company's DNA.

Cloud as a Story - Vunvulea Radu

Search This Blog

19th of July, the digital blackout

Comments

Post a Comment

Popular posts from this blog

How to audit an Azure Cosmos DB

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Migrating to the cloud is quick and easy (Pill 2 of 5 / Cloud Pills)