404 Errors are not only for on-premises systems. The business expects that the cloud is equal to 99.9999% availability, leading to false expectations and a smaller budget for developing and running a high-availability solution. Public cloud vendors like AWS, Google Cloud, and Microsoft Azure invest in fault tolerance and redundancy solutions, providing many nines for each of their services. The reliability of systems also depends on how the architecture was done and how the solution is managed. The IT team is responsible for using, designing, and managing the cloud services.
The cloud services are designed to minimise downtime through thorough data replication, automated failover policies, and availability zones. Statistically, a cloud service, a cloud vendor, will be down. As the number of cloud services and usage increase, we should expect that specific services will have downtime sometime in the future. For example, an update of a cloud service can create a ripple effect across multiple services, as has already happened in the past.
It does not mean you must protect yourself from all possible failures. Depending on the business impact and the type of service, the team needs to decide if they want to build and design a failover mechanism. For example, global identity and access management solutions (IAM), like Entra ID (former Active Directory), are the key services for most Microsoft Azure systems. Is it worth designing and building a failover solution for it? Can you afford your services not to be accessible for 1 hour in the next 3–4 years because of it? What features of Entra ID would you be unable to use because you design an on-premises failover for it? What is the cost of it? Can you manage a global or regional outage? These are just a few questions you need to cover before jumping into solution mode.
Internal and external factors can influence the reliability of cloud services, from human errors to third-party dependencies, resource management, and even application design. The most common causes are human errors, such as misconfiguration during an update or deployment or relying on 3rd party integration that has an outage and creates a domino effect. Migrations to the cloud can also bring reliability risks when a single point of failure or specific system layers are under or over-provisioning, creating bottlenecks across the system. Lack of multi-region deployment is another factor that influences the reliability in a regional outage, but it will not protect the system in the case of an outage of a global service like Entra ID.
Cloud reliability can be obtained in multiple ways, from an architectural approach that covers redundancy and failover and is fault tolerant to a multi-region and multi-availability-zones approach. This needs to be combined with a backup strategy and a strong monitoring system to detect failure and provide the proper context to the support team or system to trigger a failover action. Testing should be part of this journey, where a chaos approach can ensure that you not only define a strategy but also work with success.
Cloud vendors provide a platform to deliver systems much better protected from failure. During the design, build, deploying and manage phase, we should keep in mind that a failure of cloud services will happen, and we need to ensure that we mitigate this risk. Cloud is a robust platform that, used in the right way, can help our business, but we are responsible for protecting from cloud and system failures.
All Microsoft Azure services are well documented regarding reliability and the features and architecture approaches regarding it. I would like to mention the Microsoft Azure services that help us to increase reliability and fault tolerance:
- Azure Availability Zones
- Azure Traffic Manager
- Azure Load Balancer
- ASR (Azure Site Recovery)
- Azure Backups
- Azure Monitor
- Azure Cosmos DB
- AKS (Azure Kubernetes Service)
- Azure App Gateway
- Azure Service Bus
- Azure Event Hub
Comments
Post a Comment