Skip to main content

Navigating Cloud Strategy after Azure Central US Region Outage

 Looking back, July 19, 2024, was challenging for customers using Microsoft Azure or Windows machines. Two major outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. These two outages affected many people and put many businesses on pause for a few hours or even days.

The overlap of these two issues was a nightmare for travellers. In addition to blue screens in the airport terminals, they could not get additional information from the airport website, airline personnel, or the support line because they were affected by the outage in the Central US region or the CrowdStrike outage.

 But what happened in reality?

A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provided a tool that helped customers fix the issue.

Around the same time, customers running their businesses in the Central US Region started to notify availability, connectivity issues, or service management failures. The outage affected computation, database services, and even Microsoft 365, Dynamic 365, and Microsoft Entra (AD) because of the cross-dependencies. The cause of the Microsoft Azure Central US outage was related to a security component of Azure Storage that created a domino effect.

The Microsoft team's reaction was fast, and customers who followed the reliability best practices recommended by Azure Well-Architecture Framework (WAF) had a minimal impact on their business. After the first minute of the outage, Microsoft detected unhealthy Azure SQL DB nodes, triggering further investigation and mitigation actions. The service monitoring detected drops in Azure VM availability after 26 minutes. After less than 2 hours, all Azure SQL DB instances with geo-failover configured were restored in another US Region.

 Customer with a minimal impact on their business were the ones that:

  • Implemented a strong DR strategy across all their ecosystem (e.g., Azure SQL DB, Azure Cosmos DB)
  • Reviewed and implemented the reliability strategies of Azure WAF according to their business needs.
  • Kept DR playbooks up-to-date and ran drills every few weeks or months.
  • Implemented an active-active or active-passive approach to the computation layer.
  • Dashboard and alerts of Azure Services and payloads were configured.
  • Right people from the support team were notified by alerts and could take further actions.
  • IaC, automation and up-to-date deployment playbooks
  • Backup and copies of data in a separate region

 These two outages showed how fragile the digital ecosystem is. The butterfly effect generated by a slight disturbance of Azure Storage's security layer almost put many businesses on hold. It was a real test for companies without a strong reliability strategy. Most affected were those who relied only on the Microsoft Azure Central US Region without a working DR strategy. The lack of infrastructure-as-code (IaC), automation and a working disaster recovery playbook made the situation harder. This blackout underscores the importance of robust disaster recovery strategies and the need for regular testing and maintenance.

 Is multi-cloud and hybrid cloud the future because of it?

Running the same system in multiple cloud vendors is challenging. It comes with extra cost and longer timelines and additional complexity, making it harder for a company to use each cloud vendor's key features and capabilities. This drastically increases the running and maintenance costs, making them less flexible and attractive. Even from the reliability point of view, things are more complicated and harder to maintain.

The hybrid and multi-cloud ecosystems are only the beginning. The lack of options related to database, storage, and computation products that are platform-agnostic but optimised to run on multiple cloud vendors makes it difficult to find the right balance between value and complexity. This market is expected to grow and become more platform-agnostic from the customer side and more aware of vendor-specific capabilities at the platform layer.

 Some specific industries and services require a hybrid or multi-cloud approach from the beginning. Payment, transportation, and healthcare services should rely on a strong reliability strategy involving at least two vendors—multi-cloud or hybrid cloud- for specific business streams.

 Multi-cloud is too expensive, so what should I do?

The Microsoft Azure outage on July 19th affected organisations with a poor reliability strategy. In most situations, the business applications ran only in the Central US, and no disaster recovery strategy existed. Best practices related to reliability were not implemented, disaster recovery playbooks were outdated, and the support team was not ready to manage such situations. The lack of IaC and automation added additional complexity to the restoration procedure and increased the complexity.

 In most cases, the situation could have been avoided if customers had followed Microsoft's DR recommendations for each cloud service. Following the DR strategies of Microsoft Azure Well Architecture Framework and aligning with business-specific needs ensured that data and workloads were ready for use in a different Azure Region. High availability can be accomplished using active-active or active-passive strategies.

DR strategies that rely on data backup in a different region, automated restoration playbooks and IaC can provide a strong DR strategy with a potential downtime of 1-2 hours. If your business can afford to be offline for a few hours, this strategy can provide a good balance between reliability and cloud infrastructure costs.

 How can I improve my reliability score?

An assessment is required to improve the reliability of your solution. In addition to a readiness assessment of the infrastructure, applications, data, security, and compliance policies, the assessment needs to identify the dependency graph, the scaling strategy, and the automation maturity level. Understanding the business context and the impact of availability on the business is crucial.

The main drivers should be the business needs, aligned with compliance and industry-specific characteristics. The recipe for success is a balanced reliability strategy, where acceptable downtime is considered and mitigated accordingly.

 The future of hybrid & multi-cloud

The initial momentum after July 19th was on multi-cloud strategy. A multi-cloud strategy only partially protects you and comes with additional risks and complexity. Indirect dependencies such as IAM, virtualisation, and storage platforms can be your platform's Achilles’ heel.

External dependencies are another essential factor that needs to be considered that can affect business continuity.

 The first step in the high-reliability journey is to follow each cloud provider's recommendations and align them with the specific business needs. The second layer of protection is an active-active or active-passive strategy across the same cloud vendor. Expanding it across a multi-cloud or hybrid approach is the third layer of protection, which provides high reliability but is more complex and expensive.

Remember that even a multi-cloud approach using the same platform is not bulletproof. An issue like the one reported by CrowdStrike could affect the selected multi-cloud platform, creating a digital blackout across your systems.

A mitigation plan and recovery playbook are mandatory and make the difference between success stories and 404 businesses.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine: threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration: TeamCity .NET 4.51 EF 6.0.2 VS2013 It see...