Designing HADR on Azure and how AI can help

High availability and disaster recovery (HADR) is not a simple, one-time configuration. It requires a disciplined approach: identify possible failures, clarify business expectations, and select solutions that fulfill those requirements.

The process start with two key objectives:

Recovery Time Objective (RTO): how long you can afford to be down after an outage.
Recovery Point Objective (RPO): how much data loss (in time) the business can tolerate.

These targets are set by the business, but they must be realistic. For example, if backups require several hours, a two-hour RTO is not feasible. Define RTO and RPO for the application and its critical components, document them, and review them regularly. IaaS or PaaS: adapt your HADR strategy.

On Azure, availability is different depending on whether you run SQL Server on virtual machines (IaaS) or use managed services like Azure SQL Database / Azure SQL Managed Instance (PaaS). With IaaS, you can choose SQL Server features such as Always On FCIs, Availability Groups, and log shipping. With PaaS, HADR is largely built in, and you enable the options the platform provides. For virtual machines, Azure offers three primary infrastructure options: Availability Sets, Availability Zones, and Azure Site Recovery.covery.

Availability Sets distribute virtual machines across fault and update domains to minimize the impact of hardware failures or maintenance. Availability Zones place workloads in separate physical datacenters within the same region (zones 1, 2, or 3). You cannot combine sets and zones, so select one approach. If you use synchronous replication between nodes, test for latency.

For cross-region disaster recovery, Azure Site Recoveryreplicates virtual machines between regions. However, it operates at the VM level and does not account for application or database transactions. As a result, it may achieve RTO but not RPO, depending on workload and data consistency.

Simplified configuration with PaaS

For Azure SQL Database and Azure SQL Managed Instance, Microsoft offers platform-driven options such as active geo-replication and auto-failover groups. While the platform is developed for high availability, applications still require robust engineering, including connection retry logic, effective handling of transient faults, and well-defined operational procedures. Azure SQL also provides features such as Accelerated Database Recovery (ADR), which enhances database recovery performance and is enabled by default in many configurations. Networking is critical in hybrid environments.

Hybrid designs that combine on-premises and Azure environments are common for migration and disaster recovery. In these cases, networking often determines whether RTO and RPO targets are met, and direct exposure ofe should be avoided.

Hybrid recovery should be approached as a system design challenge rather than solely as a backup issue. Identity, DNS, routing, firewall rules, and dependency mapping must all be validated under failover conditions.

Where AI helps

AI does not replace architectural fundamentals, but it can improve decision quality and help prevent errors:

Clarify requirements: translate general statements such as “we can’t be down” into measurable RTO and RPO targets with realistic cost considerations.
Find weak spots early: detect single points of failure, missing retry mechanisms, and hidden dependencies that may not fail over properly.ns, evaluate test outcomes, and prioritise fixes that reduce risk fastest.
Enhance operations: improve incident response through anomaly detection, recommended runbooks, and more effective alerting to reduce unnecessary notifications.

Effective HADR is simple in principle: define RTO and RPO, select the appropriate IaaS or PaaS approach, design for failure, and perform regular testing. AI accelerates this process, increases consistency, and reduces blind spots, making resilience an ongoing practice rather than a static document.

Why Database Modernization Matters for AI

When companies transition to the cloud, they typically begin with applications and virtual machines, which is often the easier part of the process. The actual complexity arises later when databases are moved. To save time and effort, cloud adoption is more of a cloud migration in an IaaS manner, fulfilling current, but not future needs. Even organisations that are already in the cloud find that their databases, although “migrated,” are not genuinely modernised. This disparity becomes particularly evident when they begin to explore AI technologies. Understanding Modernisation Beyond Migration Database modernisation is distinct from merely relocating an outdated database to Azure. It's about making your data layer ready for future needs, like automation, real-time analytics, and AI capabilities. AI needs high throughput, which can be achieved using native DB cloud capabilities. When your database runs in a traditional setup (even hosted in the cloud), in that case, you will enc...

Cloud as a Story - Vunvulea Radu

Search This Blog