Skip to main content

19th of July, the digital blackout

Looking back, the 19th of July 2024 was challenging for customers using Microsoft Azure or Windows machines. Two significant outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. In today's article, we’ll focus on the Microsoft Azure outage.



These two outages affected many people and put many businesses on pause for a few hours or even days.

 CrowdStrike Outage

A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provided a tool that helped customers fix the issue.

 Azure Storage Outage in Central US Region

Around the same time, customers running their businesses in the Central US Region started to notify availability, connectivity issues, or service management failures. The outage affected services such as Azure VMs, Azure Cosmos DB, Azure SQL, and even Microsoft 365, Dynamic 365, and Microsoft Entra (AD) because of the cross-dependencies.

The cause of the outage was a security layer of storage scale units that accept disk operations only from specific ranges of network addresses. The update and propagation mechanism were faulty, and an incomplete list of allowed ranges of network addresses was broadcasting, causing some storage scale units not to be accessible from computing services in Azure Central US. The propagation mechanism did not check the issues of Azure VMs. It continued to deploy across multiple availability zones, causing a regional outage, even for customers that followed availability zone best practices. This complex chain of events underscores the intricate nature of cloud infrastructure and the potential for widespread impact when a single component fails.

The impact was much lower for the customers who used failover and DR strategies across their Microsoft Azure ecosystem. Azure Cosmos DB customers running in a multi-region write approach were not affected by the outage in contrast with customers running in a single region configured for Central US that was wholly or partially affected. A similar experience had Azure SQL DB customers with automatic failover policies configured, where the geo-secondary was elevated as the new primary after the incident.

 Limited or zero outage

After the first minute of the outage, Microsoft detected unhealthy Azure SQL DB nodes, triggering further investigation and mitigation actions. The service monitoring detected drops in Azure VM availability after 26 minutes. After less than 2 hours, all Azure SQL DB instances with geo-failover configured were restored.

The Microsoft team's reaction was fast, and customers who followed the reliability best practices recommended by Azure Well-Architecture Framework (WAF) had a minimal impact on their business. The Azure WAF is a set of design principles and best practices for building and operating reliable, secure, and efficient systems in the cloud. It provides guidance on how to design and implement your cloud solutions, and following its recommendations can significantly improve the resilience of your systems. Most customers affected by the outage lacked a DR strategy and relied on the services provided only by the Central US Region.

Customer with a minimal impact on their business were the ones that:

  • Implemented a strong DR strategy across all their ecosystem (e.g. Azure SQL DB, Azure Cosmos DB)
  • Reviewed and implemented the reliability strategies of Azure WAF according to their business needs
  • Kept DR playbooks up-to-date and ran drills every few weeks or months
  • Implemented an active-active or active-passive approach to the computation layer
  • Dashboard and alerts of Azure Services and payloads were configured
  • Right people from the support team were notified by alerts and could take further actions
  • IaC, automation and up-to-date deployment playbooks
  • Backup and copies of data in a separate region

Digital backout

A large number of customers suffered from a digital blackout , a term used to describe a complete loss of digital services. Most affected were those who relied only on the Central US Region without a working DR strategy. A group of customers also had a DR strategy implemented but was not tested. The playbook was not kept up-to-date, and there were no mechanisms to notify the right people in the organisation that there were reliability issues. This blackout underscores the importance of robust disaster recovery strategies and the need for regular testing and maintenance.

Customers who lacked of IaC and deployment playbooks were the most affected. They could not restore the cloud environment even with access to the data backups, storage and payload images. Their systems were unavailable for even more than a few hours.

 My recommendation

The organisations had the perfect recipe for disaster in the rush to go live as soon as possible, keeping development costs minimal and putting the support team on a secondary level.

As clouds expand and become part of our day-to-day lives, the odds for things to go wrong increase.

The same DR and failover strategies we use for on-premises systems must be used for cloud solutions. Microsoft Azure and other cloud vendors provide us with the tools and mechanisms to configure them much easier, but they are not free. They come with an additional cloud infrastructure cost as they would come for an on-premises approach.

To keep the light on even on blue days like this, you need to ensure that IaC, automation, alerts, and WAF are part of your company's DNA.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills) The idea that moving to the cloud reduces the costs is a common misconception. The cloud infrastructure provides flexibility, scalability, and better CAPEX, but it does not guarantee lower costs without proper optimisation and management of the cloud services and infrastructure. Idle and unused resources, overprovisioning, oversize databases, and unnecessary data transfer can increase running costs. The regional pricing mode, multi-cloud complexity, and cost variety add extra complexity to the cost function. Cloud adoption without a cost governance strategy can result in unexpected expenses. Improper usage, combined with a pay-as-you-go model, can result in a nightmare for business stakeholders who cannot track and manage the monthly costs. Cloud-native services such as AI services, managed databases, and analytics platforms are powerful, provide out-of-the-shelve capabilities, and increase business agility and innovation. H...