Skip to main content

19th of July, the digital blackout

Looking back, the 19th of July 2024 was challenging for customers using Microsoft Azure or Windows machines. Two significant outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. In today's article, we’ll focus on the Microsoft Azure outage.



These two outages affected many people and put many businesses on pause for a few hours or even days.

 CrowdStrike Outage

A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provided a tool that helped customers fix the issue.

 Azure Storage Outage in Central US Region

Around the same time, customers running their businesses in the Central US Region started to notify availability, connectivity issues, or service management failures. The outage affected services such as Azure VMs, Azure Cosmos DB, Azure SQL, and even Microsoft 365, Dynamic 365, and Microsoft Entra (AD) because of the cross-dependencies.

The cause of the outage was a security layer of storage scale units that accept disk operations only from specific ranges of network addresses. The update and propagation mechanism were faulty, and an incomplete list of allowed ranges of network addresses was broadcasting, causing some storage scale units not to be accessible from computing services in Azure Central US. The propagation mechanism did not check the issues of Azure VMs. It continued to deploy across multiple availability zones, causing a regional outage, even for customers that followed availability zone best practices. This complex chain of events underscores the intricate nature of cloud infrastructure and the potential for widespread impact when a single component fails.

The impact was much lower for the customers who used failover and DR strategies across their Microsoft Azure ecosystem. Azure Cosmos DB customers running in a multi-region write approach were not affected by the outage in contrast with customers running in a single region configured for Central US that was wholly or partially affected. A similar experience had Azure SQL DB customers with automatic failover policies configured, where the geo-secondary was elevated as the new primary after the incident.

 Limited or zero outage

After the first minute of the outage, Microsoft detected unhealthy Azure SQL DB nodes, triggering further investigation and mitigation actions. The service monitoring detected drops in Azure VM availability after 26 minutes. After less than 2 hours, all Azure SQL DB instances with geo-failover configured were restored.

The Microsoft team's reaction was fast, and customers who followed the reliability best practices recommended by Azure Well-Architecture Framework (WAF) had a minimal impact on their business. The Azure WAF is a set of design principles and best practices for building and operating reliable, secure, and efficient systems in the cloud. It provides guidance on how to design and implement your cloud solutions, and following its recommendations can significantly improve the resilience of your systems. Most customers affected by the outage lacked a DR strategy and relied on the services provided only by the Central US Region.

Customer with a minimal impact on their business were the ones that:

  • Implemented a strong DR strategy across all their ecosystem (e.g. Azure SQL DB, Azure Cosmos DB)
  • Reviewed and implemented the reliability strategies of Azure WAF according to their business needs
  • Kept DR playbooks up-to-date and ran drills every few weeks or months
  • Implemented an active-active or active-passive approach to the computation layer
  • Dashboard and alerts of Azure Services and payloads were configured
  • Right people from the support team were notified by alerts and could take further actions
  • IaC, automation and up-to-date deployment playbooks
  • Backup and copies of data in a separate region

Digital backout

A large number of customers suffered from a digital blackout , a term used to describe a complete loss of digital services. Most affected were those who relied only on the Central US Region without a working DR strategy. A group of customers also had a DR strategy implemented but was not tested. The playbook was not kept up-to-date, and there were no mechanisms to notify the right people in the organisation that there were reliability issues. This blackout underscores the importance of robust disaster recovery strategies and the need for regular testing and maintenance.

Customers who lacked of IaC and deployment playbooks were the most affected. They could not restore the cloud environment even with access to the data backups, storage and payload images. Their systems were unavailable for even more than a few hours.

 My recommendation

The organisations had the perfect recipe for disaster in the rush to go live as soon as possible, keeping development costs minimal and putting the support team on a secondary level.

As clouds expand and become part of our day-to-day lives, the odds for things to go wrong increase.

The same DR and failover strategies we use for on-premises systems must be used for cloud solutions. Microsoft Azure and other cloud vendors provide us with the tools and mechanisms to configure them much easier, but they are not free. They come with an additional cloud infrastructure cost as they would come for an on-premises approach.

To keep the light on even on blue days like this, you need to ensure that IaC, automation, alerts, and WAF are part of your company's DNA.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine: threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration: TeamCity .NET 4.51 EF 6.0.2 VS2013 It see...

Navigating Cloud Strategy after Azure Central US Region Outage

 Looking back, July 19, 2024, was challenging for customers using Microsoft Azure or Windows machines. Two major outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. These two outages affected many people and put many businesses on pause for a few hours or even days. The overlap of these two issues was a nightmare for travellers. In addition to blue screens in the airport terminals, they could not get additional information from the airport website, airline personnel, or the support line because they were affected by the outage in the Central US region or the CrowdStrike outage.   But what happened in reality? A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provi...