Skip to main content

Posts

Recent posts

How I Prepare for Cloud Vendor Audits (Azure & AWS) - A Practical View from the Field

 In the last few years, I was directly involved in five cloud vendor audits: three for Azure competencies (migration + modernisation) and two for AWS, including the migration-related one. After you do a few of them, you understand that audit is not only about “having the right architecture”. It is mainly about process, traceability, and evidence, and about being able to explain why you did something one way rather than exactly as in the vendor reference. Below, I share the approach I use, grouped in phases: before preparation, during preparation, audit day, and after audit. This is written in a very practical way, because in real life, you don’t win with theory, you win with organisation. 1) Before you start: set the foundation (and remove surprises) Understand requirements like a technical checklist, not like a brochure. The first step is simple and hard: a deep understanding of the technical requirements for every section. Not only must the audit lead understand it, but the proj...

Large-scale parallel computing - Checklist

 Managing large-scale parallel workloads presents unique challenges beyond simply adding more machines. Success requires clear decision-making, effective automation, and early testing of complex components. The following flow outlines each step and its importance. Start with the numbers Before anything else, agree on the measurable goals: how many vCPUs, what budget, how fast jobs must start and finish, and how tolerant you are to failures. These numbers keep conversations practical. If you can’t measure it, it’s hard to improve it. What to do, and Why 1. Define goals & NFRs Document concurrency targets, scheduling latency, SLOs, and team budgets. Clear, specific goals ensure alignment across all stakeholders. 2. Split control plane and executors Treat the control plane as the system’s core for APIs, policy, and billing, while executors handle compute and data tasks. Isolate cloud-specific logic with adapters to simplify future cloud integrations. 3. Design the data plan Store ...

Architectures and Lessons | When the cloud Must run hundreds of thousands of vCPUs

 Running hundreds of thousands of virtual CPUs in the cloud requires a new approach. It involves not just more machines, but also different thinking, operations, and cost management. In this article, I present practical architecture patterns, real-world trade-offs, and operational lessons for teams evolving from small experiments to resilient, multi-cloud platforms for highly parallel workloads. Control plane and executors — a simple mental model A helpful model for these platforms is to separate them into two roles: the control plane and the executors. The control plane manages APIs, scheduling, authentication, metadata, and billing. Executors are the compute resources, such as VM pools, containers, or bare-metal servers. This separation is important for portability. If the control plane defines workloads and abstracts cloud-specific details behind adapters, you can connect multiple execution environments, including various clouds or on-premises clusters. The control plane should ...

Azure DevOps Server to Services: what really moves (and what doesn’t)

 In the last few months, I looked more closely at a topic many teams put off: moving from Azure DevOps Server (on‑prem) to Azure DevOps Services (cloud). On paper, it sounds simple – “lift and shift” – but in practice, it is a mix of automated migration and human coordination. What can be migrated? The good news is that the core platform data usually moves well when you use Microsoft’s Azure DevOps Data Migration Tool (DMT). You can bring across projects and collection configuration, Git repositories with full commit history, and Azure Boards data, such as work items, links, and attachments. Pipeline definitions (YAML and Classic) are migrated as definitions so that teams can see their pipelines in the cloud on Day 1. What cannot be migrated is important for defining expectations. Pipeline execution history (old runs, logs, artefacts) does not transfer to the cloud. Secrets are another big one: secret variables, tokens and passwords from variable groups or service connections are n...

Designing HADR on Azure and how AI can help

 High availability and disaster recovery (HADR) is not a simple, one-time configuration. It requires a disciplined approach: identify possible failures, clarify business expectations, and select solutions that fulfill those requirements. The process start with two key objectives: Recovery Time Objective (RTO): how long you can afford to be down after an outage. Recovery Point Objective (RPO): how much data loss (in time) the business can tolerate. These targets are set by the business, but they must be realistic. For example, if backups require several hours, a two-hour RTO is not feasible. Define RTO and RPO for the application and its critical components, document them, and review them regularly. IaaS or PaaS: adapt your HADR strategy. On Azure, availability is different depending on whether you run SQL Server on virtual machines (IaaS) or use managed services like Azure SQL Database / Azure SQL Managed Instance (PaaS). With IaaS, you can choose SQL Server features such as Always...

Choosing the right “SQL flavour” on Azure

 When you're moving SQL to Azure, you'll face a lot of choices. Azure has several different ways you can run SQL, with each own tradeoffs around control, compatibility, cost, and how much day-to-day work is involved. If you focus only on moving quickly, you might end up with higher costs down the road, run into missing features, or give yourself more maintenance headaches than expected. Before you start comparing the different products, step back and think about what you really need to accomplish. As you weigh your options, keep these three big requirements in mind: 1. Scalability and cost shape Decide early on if you'll need to scale up (make a server bigger) or scale out (spread across more servers). Scaling up is usually easier, while scaling out gives you more flexibility, but might mean changing your application and being more disciplined operationally. Think about your workload: if it's pretty steady, provisioned compute is a good fit; if it varies a lot, you'...