Skip to main content

Cloud Myths: Cloud is Always ON (Pill 5 of 5 / Cloud Pills)


404 Errors are not only for on-premises systems. The business expects that the cloud is equal to 99.9999% availability, leading to false expectations and a smaller budget for developing and running a high-availability solution. Public cloud vendors like AWS, Google Cloud, and Microsoft Azure invest in fault tolerance and redundancy solutions, providing many nines for each of their services. The reliability of systems also depends on how the architecture was done and how the solution is managed. The IT team is responsible for using, designing, and managing the cloud services.

The cloud services are designed to minimise downtime through thorough data replication, automated failover policies, and availability zones. Statistically, a cloud service, a cloud vendor, will be down. As the number of cloud services and usage increase, we should expect that specific services will have downtime sometime in the future. For example, an update of a cloud service can create a ripple effect across multiple services, as has already happened in the past.



It does not mean you must protect yourself from all possible failures. Depending on the business impact and the type of service, the team needs to decide if they want to build and design a failover mechanism. For example, global identity and access management solutions (IAM), like Entra ID (former Active Directory), are the key services for most Microsoft Azure systems. Is it worth designing and building a failover solution for it? Can you afford your services not to be accessible for 1 hour in the next 3–4 years because of it? What features of Entra ID would you be unable to use because you design an on-premises failover for it? What is the cost of it? Can you manage a global or regional outage? These are just a few questions you need to cover before jumping into solution mode.

Internal and external factors can influence the reliability of cloud services, from human errors to third-party dependencies, resource management, and even application design. The most common causes are human errors, such as misconfiguration during an update or deployment or relying on 3rd party integration that has an outage and creates a domino effect. Migrations to the cloud can also bring reliability risks when a single point of failure or specific system layers are under or over-provisioning, creating bottlenecks across the system. Lack of multi-region deployment is another factor that influences the reliability in a regional outage, but it will not protect the system in the case of an outage of a global service like Entra ID.

Cloud reliability can be obtained in multiple ways, from an architectural approach that covers redundancy and failover and is fault tolerant to a multi-region and multi-availability-zones approach. This needs to be combined with a backup strategy and a strong monitoring system to detect failure and provide the proper context to the support team or system to trigger a failover action. Testing should be part of this journey, where a chaos approach can ensure that you not only define a strategy but also work with success.

Cloud vendors provide a platform to deliver systems much better protected from failure. During the design, build, deploying and manage phase, we should keep in mind that a failure of cloud services will happen, and we need to ensure that we mitigate this risk. Cloud is a robust platform that, used in the right way, can help our business, but we are responsible for protecting from cloud and system failures.

All Microsoft Azure services are well documented regarding reliability and the features and architecture approaches regarding it. I would like to mention the Microsoft Azure services that help us to increase reliability and fault tolerance:

  • Azure Availability Zones
  • Azure Traffic Manager
  • Azure Load Balancer
  • ASR (Azure Site Recovery)
  • Azure Backups
  • Azure Monitor
  • Azure Cosmos DB
  • AKS (Azure Kubernetes Service)
  • Azure App Gateway
  • Azure Service Bus
  • Azure Event Hub

 

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills) The idea that moving to the cloud reduces the costs is a common misconception. The cloud infrastructure provides flexibility, scalability, and better CAPEX, but it does not guarantee lower costs without proper optimisation and management of the cloud services and infrastructure. Idle and unused resources, overprovisioning, oversize databases, and unnecessary data transfer can increase running costs. The regional pricing mode, multi-cloud complexity, and cost variety add extra complexity to the cost function. Cloud adoption without a cost governance strategy can result in unexpected expenses. Improper usage, combined with a pay-as-you-go model, can result in a nightmare for business stakeholders who cannot track and manage the monthly costs. Cloud-native services such as AI services, managed databases, and analytics platforms are powerful, provide out-of-the-shelve capabilities, and increase business agility and innovation. H...