Skip to main content

Designing HADR on Azure and how AI can help

 High availability and disaster recovery (HADR) is not a simple, one-time configuration. It requires a disciplined approach: identify possible failures, clarify business expectations, and select solutions that fulfill those requirements.

The process start with two key objectives:
  • Recovery Time Objective (RTO): how long you can afford to be down after an outage.
  • Recovery Point Objective (RPO): how much data loss (in time) the business can tolerate.
These targets are set by the business, but they must be realistic. For example, if backups require several hours, a two-hour RTO is not feasible. Define RTO and RPO for the application and its critical components, document them, and review them regularly. IaaS or PaaS: adapt your HADR strategy.
On Azure, availability is different depending on whether you run SQL Server on virtual machines (IaaS) or use managed services like Azure SQL Database / Azure SQL Managed Instance (PaaS). With IaaS, you can choose SQL Server features such as Always On FCIs, Availability Groups, and log shipping. With PaaS, HADR is largely built in, and you enable the options the platform provides. For virtual machines, Azure offers three primary infrastructure options: Availability Sets, Availability Zones, and Azure Site Recovery.covery.
Availability Sets distribute virtual machines across fault and update domains to minimize the impact of hardware failures or maintenance. Availability Zones place workloads in separate physical datacenters within the same region (zones 1, 2, or 3). You cannot combine sets and zones, so select one approach. If you use synchronous replication between nodes, test for latency.
For cross-region disaster recovery, Azure Site Recoveryreplicates virtual machines between regions. However, it operates at the VM level and does not account for application or database transactions. As a result, it may achieve RTO but not RPO, depending on workload and data consistency.

Simplified configuration with PaaS

For Azure SQL Database and Azure SQL Managed Instance, Microsoft offers platform-driven options such as active geo-replication and auto-failover groups. While the platform is developed for high availability, applications still require robust engineering, including connection retry logic, effective handling of transient faults, and well-defined operational procedures. Azure SQL also provides features such as Accelerated Database Recovery (ADR), which enhances database recovery performance and is enabled by default in many configurations. Networking is critical in hybrid environments.
Hybrid designs that combine on-premises and Azure environments are common for migration and disaster recovery. In these cases, networking often determines whether RTO and RPO targets are met, and direct exposure ofe should be avoided.
Hybrid recovery should be approached as a system design challenge rather than solely as a backup issue. Identity, DNS, routing, firewall rules, and dependency mapping must all be validated under failover conditions.


Where AI helps

AI does not replace architectural fundamentals, but it can improve decision quality and help prevent errors:
  • Clarify requirements: translate general statements such as “we can’t be down” into measurable RTO and RPO targets with realistic cost considerations.
  • Find weak spots early: detect single points of failure, missing retry mechanisms, and hidden dependencies that may not fail over properly.ns, evaluate test outcomes, and prioritise fixes that reduce risk fastest.
  • Enhance operations: improve incident response through anomaly detection, recommended runbooks, and more effective alerting to reduce unnecessary notifications.
Effective HADR is simple in principle: define RTO and RPO, select the appropriate IaaS or PaaS approach, design for failure, and perform regular testing. AI accelerates this process, increases consistency, and reduces blind spots, making resilience an ongoing practice rather than a static document.

Comments

Popular posts from this blog

Why Database Modernization Matters for AI

  When companies transition to the cloud, they typically begin with applications and virtual machines, which is often the easier part of the process. The actual complexity arises later when databases are moved. To save time and effort, cloud adoption is more of a cloud migration in an IaaS manner, fulfilling current, but not future needs. Even organisations that are already in the cloud find that their databases, although “migrated,” are not genuinely modernised. This disparity becomes particularly evident when they begin to explore AI technologies. Understanding Modernisation Beyond Migration Database modernisation is distinct from merely relocating an outdated database to Azure. It's about making your data layer ready for future needs, like automation, real-time analytics, and AI capabilities. AI needs high throughput, which can be achieved using native DB cloud capabilities. When your database runs in a traditional setup (even hosted in the cloud), in that case, you will enc...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

[Post Event] Azure AI Connect, March 2025

On March 13th, I had the opportunity to speak at Azure AI Connect about modern AI architectures.  My session focused on the importance of modernizing cloud systems to efficiently handle the increasing payload generated by AI.