Skip to main content

Posts

Recent posts

Architectures and Lessons | When the cloud Must run hundreds of thousands of vCPUs

 Running hundreds of thousands of virtual CPUs in the cloud requires a new approach. It involves not just more machines, but also different thinking, operations, and cost management. In this article, I present practical architecture patterns, real-world trade-offs, and operational lessons for teams evolving from small experiments to resilient, multi-cloud platforms for highly parallel workloads. Control plane and executors — a simple mental model A helpful model for these platforms is to separate them into two roles: the control plane and the executors. The control plane manages APIs, scheduling, authentication, metadata, and billing. Executors are the compute resources, such as VM pools, containers, or bare-metal servers. This separation is important for portability. If the control plane defines workloads and abstracts cloud-specific details behind adapters, you can connect multiple execution environments, including various clouds or on-premises clusters. The control plane should ...

Azure DevOps Server to Services: what really moves (and what doesn’t)

 In the last few months, I looked more closely at a topic many teams put off: moving from Azure DevOps Server (on‑prem) to Azure DevOps Services (cloud). On paper, it sounds simple – “lift and shift” – but in practice, it is a mix of automated migration and human coordination. What can be migrated? The good news is that the core platform data usually moves well when you use Microsoft’s Azure DevOps Data Migration Tool (DMT). You can bring across projects and collection configuration, Git repositories with full commit history, and Azure Boards data, such as work items, links, and attachments. Pipeline definitions (YAML and Classic) are migrated as definitions so that teams can see their pipelines in the cloud on Day 1. What cannot be migrated is important for defining expectations. Pipeline execution history (old runs, logs, artefacts) does not transfer to the cloud. Secrets are another big one: secret variables, tokens and passwords from variable groups or service connections are n...

Designing HADR on Azure and how AI can help

 High availability and disaster recovery (HADR) is not a simple, one-time configuration. It requires a disciplined approach: identify possible failures, clarify business expectations, and select solutions that fulfill those requirements. The process start with two key objectives: Recovery Time Objective (RTO): how long you can afford to be down after an outage. Recovery Point Objective (RPO): how much data loss (in time) the business can tolerate. These targets are set by the business, but they must be realistic. For example, if backups require several hours, a two-hour RTO is not feasible. Define RTO and RPO for the application and its critical components, document them, and review them regularly. IaaS or PaaS: adapt your HADR strategy. On Azure, availability is different depending on whether you run SQL Server on virtual machines (IaaS) or use managed services like Azure SQL Database / Azure SQL Managed Instance (PaaS). With IaaS, you can choose SQL Server features such as Always...

Choosing the right “SQL flavour” on Azure

 When you're moving SQL to Azure, you'll face a lot of choices. Azure has several different ways you can run SQL, with each own tradeoffs around control, compatibility, cost, and how much day-to-day work is involved. If you focus only on moving quickly, you might end up with higher costs down the road, run into missing features, or give yourself more maintenance headaches than expected. Before you start comparing the different products, step back and think about what you really need to accomplish. As you weigh your options, keep these three big requirements in mind: 1. Scalability and cost shape Decide early on if you'll need to scale up (make a server bigger) or scale out (spread across more servers). Scaling up is usually easier, while scaling out gives you more flexibility, but might mean changing your application and being more disciplined operationally. Think about your workload: if it's pretty steady, provisioned compute is a good fit; if it varies a lot, you'...

Why AI ROI is more volatile than classical IT projects

 Traditional IT solutions typically remain stable for years, requiring only occasional patches, scaling, or feature additions. In contrast, AI systems operate in a changing environment, so ROI must be monitored and recalculated regularly. Here are the three main reasons for the volatility of AI in Azure and other cloud projects: Models are frequently replaced or retired. In the Microsoft and Azure ecosystem, model families evolve rapidly. A model used last year may become unavailable or inferior. Even without application changes, costs, speed, and quality can fluctuate. Quality can change over time. New documents in SharePoint, evolving policies, and new user questions can impact performance. Without regular updates to your RAG pipeline, prompts, and evaluation sets, accuracy and trust decline, reducing system adoption. Costs can increase with usage. Token cost is only one factor; vector search, storage, observability, and human review become crucial as adoption grows. What appears...

AI ROI without hype: a practical way to measure value using risk adjustment + Azure Copilot example

Most people know what ROI means, but it’s harder to calculate for AI projects. The numbers are less predictable than with traditional platforms because many AI projects never reach stable production. IDC says only about 44% of custom AI apps and 53% of third-party AI apps make it from proof of concept to production. That’s why it’s important to look at ROI through a risk lens, not just cost versus benefit. One useful approach is to use a risk-adjusted formula: AI ROI = (AI Business Value Income / (Initial Investment + Annual Costs)) × Success Probability where, >AI Business Value Income (over N years) Consider a 2 to 3 year period and include both direct and indirect value: Direct: time saved, fewer tickets, higher conversion, lower fraud. Indirect: improved customer or employee experience and quicker decisions. For these, use measurable stand-ins like CSAT, churn, time to resolution, or hours saved, and estimate conservatively. >Initial Investment This covers more than just buil...