Cloud as a Story - Vunvulea Radu

Posts

Showing posts from 2026

From “Deployed” to “Compliant”: Automating Azure IaC Checks

Infrastructure as Code should make cloud delivery faster, but compliance checks usually appear too late — after deployment or at the end of the pipeline. Then teams discover the same issues again: naming is not following conventions, mandatory tags are missing, TLS is not enforced, public access is enabled, diagnostics are not configured. None of this is new. Microsoft already provides strong guidance in the Well-Architected Framework (WAF) and Cloud Adoption Framework (CAF). The hard part is applying these rules consistently across many repos and many teams. This is why I built Azure IaC Compliance Checker : a small open-source CLI that checks Azure IaC before deployment , directly from code. Where the tool is available The tool is available on GitHub: vunvulear/azure-iac-checker . You can find it here: https://github.com/vunvulear/azure-iac-checker . Right now it’s early stage: there are no releases and no published packages , so the normal way to use it is to clone the reposit...

How I Prepare for Cloud Vendor Audits (Azure & AWS) - A Practical View from the Field

In the last few years, I was directly involved in five cloud vendor audits: three for Azure competencies (migration + modernisation) and two for AWS, including the migration-related one. After you do a few of them, you understand that audit is not only about “having the right architecture”. It is mainly about process, traceability, and evidence, and about being able to explain why you did something one way rather than exactly as in the vendor reference. Below, I share the approach I use, grouped in phases: before preparation, during preparation, audit day, and after audit. This is written in a very practical way, because in real life, you don’t win with theory, you win with organisation. 1) Before you start: set the foundation (and remove surprises) Understand requirements like a technical checklist, not like a brochure. The first step is simple and hard: a deep understanding of the technical requirements for every section. Not only must the audit lead understand it, but the proj...

Large-scale parallel computing - Checklist

Managing large-scale parallel workloads presents unique challenges beyond simply adding more machines. Success requires clear decision-making, effective automation, and early testing of complex components. The following flow outlines each step and its importance. Start with the numbers Before anything else, agree on the measurable goals: how many vCPUs, what budget, how fast jobs must start and finish, and how tolerant you are to failures. These numbers keep conversations practical. If you can’t measure it, it’s hard to improve it. What to do, and Why 1. Define goals & NFRs Document concurrency targets, scheduling latency, SLOs, and team budgets. Clear, specific goals ensure alignment across all stakeholders. 2. Split control plane and executors Treat the control plane as the system’s core for APIs, policy, and billing, while executors handle compute and data tasks. Isolate cloud-specific logic with adapters to simplify future cloud integrations. 3. Design the data plan Store ...

Architectures and Lessons | When the cloud Must run hundreds of thousands of vCPUs

Running hundreds of thousands of virtual CPUs in the cloud requires a new approach. It involves not just more machines, but also different thinking, operations, and cost management. In this article, I present practical architecture patterns, real-world trade-offs, and operational lessons for teams evolving from small experiments to resilient, multi-cloud platforms for highly parallel workloads. Control plane and executors — a simple mental model A helpful model for these platforms is to separate them into two roles: the control plane and the executors. The control plane manages APIs, scheduling, authentication, metadata, and billing. Executors are the compute resources, such as VM pools, containers, or bare-metal servers. This separation is important for portability. If the control plane defines workloads and abstracts cloud-specific details behind adapters, you can connect multiple execution environments, including various clouds or on-premises clusters. The control plane should ...

Azure DevOps Server to Services: what really moves (and what doesn’t)

In the last few months, I looked more closely at a topic many teams put off: moving from Azure DevOps Server (on‑prem) to Azure DevOps Services (cloud). On paper, it sounds simple – “lift and shift” – but in practice, it is a mix of automated migration and human coordination. What can be migrated? The good news is that the core platform data usually moves well when you use Microsoft’s Azure DevOps Data Migration Tool (DMT). You can bring across projects and collection configuration, Git repositories with full commit history, and Azure Boards data, such as work items, links, and attachments. Pipeline definitions (YAML and Classic) are migrated as definitions so that teams can see their pipelines in the cloud on Day 1. What cannot be migrated is important for defining expectations. Pipeline execution history (old runs, logs, artefacts) does not transfer to the cloud. Secrets are another big one: secret variables, tokens and passwords from variable groups or service connections are n...

Designing HADR on Azure and how AI can help

High availability and disaster recovery (HADR) is not a simple, one-time configuration. It requires a disciplined approach: identify possible failures, clarify business expectations, and select solutions that fulfill those requirements. The process start with two key objectives: Recovery Time Objective (RTO): how long you can afford to be down after an outage. Recovery Point Objective (RPO): how much data loss (in time) the business can tolerate. These targets are set by the business, but they must be realistic. For example, if backups require several hours, a two-hour RTO is not feasible. Define RTO and RPO for the application and its critical components, document them, and review them regularly. IaaS or PaaS: adapt your HADR strategy. On Azure, availability is different depending on whether you run SQL Server on virtual machines (IaaS) or use managed services like Azure SQL Database / Azure SQL Managed Instance (PaaS). With IaaS, you can choose SQL Server features such as Always...

Choosing the right “SQL flavour” on Azure

When you're moving SQL to Azure, you'll face a lot of choices. Azure has several different ways you can run SQL, with each own tradeoffs around control, compatibility, cost, and how much day-to-day work is involved. If you focus only on moving quickly, you might end up with higher costs down the road, run into missing features, or give yourself more maintenance headaches than expected. Before you start comparing the different products, step back and think about what you really need to accomplish. As you weigh your options, keep these three big requirements in mind: 1. Scalability and cost shape Decide early on if you'll need to scale up (make a server bigger) or scale out (spread across more servers). Scaling up is usually easier, while scaling out gives you more flexibility, but might mean changing your application and being more disciplined operationally. Think about your workload: if it's pretty steady, provisioned compute is a good fit; if it varies a lot, you'...

Why AI ROI is more volatile than classical IT projects

Traditional IT solutions typically remain stable for years, requiring only occasional patches, scaling, or feature additions. In contrast, AI systems operate in a changing environment, so ROI must be monitored and recalculated regularly. Here are the three main reasons for the volatility of AI in Azure and other cloud projects: Models are frequently replaced or retired. In the Microsoft and Azure ecosystem, model families evolve rapidly. A model used last year may become unavailable or inferior. Even without application changes, costs, speed, and quality can fluctuate. Quality can change over time. New documents in SharePoint, evolving policies, and new user questions can impact performance. Without regular updates to your RAG pipeline, prompts, and evaluation sets, accuracy and trust decline, reducing system adoption. Costs can increase with usage. Token cost is only one factor; vector search, storage, observability, and human review become crucial as adoption grows. What appears...

AI ROI without hype: a practical way to measure value using risk adjustment + Azure Copilot example

Most people know what ROI means, but it’s harder to calculate for AI projects. The numbers are less predictable than with traditional platforms because many AI projects never reach stable production. IDC says only about 44% of custom AI apps and 53% of third-party AI apps make it from proof of concept to production. That’s why it’s important to look at ROI through a risk lens, not just cost versus benefit. One useful approach is to use a risk-adjusted formula: AI ROI = (AI Business Value Income / (Initial Investment + Annual Costs)) × Success Probability where, >AI Business Value Income (over N years) Consider a 2 to 3 year period and include both direct and indirect value: Direct: time saved, fewer tickets, higher conversion, lower fraud. Indirect: improved customer or employee experience and quicker decisions. For these, use measurable stand-ins like CSAT, churn, time to resolution, or hours saved, and estimate conservatively. >Initial Investment This covers more than just buil...

Private doesn't mean invisible - What enterprise AI chats really mean

Many companies use AI tools such as ChatGPT Enterprise and Microsoft Copilot to raise efficiency and reduce repetitive tasks. However, it is essential to clarify the meaning of the “private” label. In an enterprise setting, “private” typically refers to daily sharing restrictions rather than absolute confidentiality. Organizations may still access these chats for governance, security, or legal reasons. ChatGPT Enterprise OpenAI states that, by default, ChatGPT Enterprise does not use business data (inputs and outputs) to train its models. Customers retain ownership and control over their data, including retention settings. OpenAI also maintains compliance with requirements such as GDPR through contractual agreements, such as a Data Processing Addendum (DPA). Within an enterprise workspace, “private chat” generally means chats are not shared with colleagues, but it does not guarantee that administrators cannot access them. Enterprise plans may use compliance tools such as the Compl...

AI Ntive cloud reference architecture on Microsoft Azure

After 17 years working with cloud technology, I’ve seen a clear pattern. AI projects rarely fail because the model is weak. More often, the problem is that the platform was built for traditional applications, not for AI. GenAI and agents add extra demands on the architecture. AI also brings unpredictable traffic and new security and governance challenges. Here’s a reference architecture I use when designing AI-native platforms on Microsoft Azure. It’s not a strict blueprint, but a practical structure to keep teams aligned and prevent surprises as the solution grows. User and API entry layer Start with a clear entry point. Focus on predictable performance, strong security, and access control. On Azure, many teams use Azure Front Door or Application Gateway for incoming traffic, then add Azure API Management to manage API exposure, throttling, authentication, and versioning. A common mistake is exposing AI endpoints directly to the internet. It might seem quick for a proof of concept, bu...

Azure Governance that scales: guardrails for fast and safe delivery

For large organizations, Azure success depends on solid governance, clear requirements, planned initiatives, and business priorities. Start with a clear hierarchy to apply rules consistently across the organization, not just to individual projects. First, I set up core elements: management groups, subscriptions, resource groups, and then resources. This structure is practical and important for scaling access and compliance controls. Management groups matter if you have multiple subscriptions and want a uniform baseline. I keep them shallow, three to four levels, since more are hard to manage. Azure allows up to six (excluding the tenant root and subscription level). Assignments at higher levels cascade down, so hierarchy matters. I use subscriptions as boundaries for billing and scaling. Splitting development, testing, and production into separate subscriptions isolates costs and risks. A dedicated subscription for shared network services, such as ExpressRoute or Virtual WAN, simp...

What a company needs to be able to deliver Cloud AI Native solutions

Cloud AI-Native delivery means turning AI from a basic demonstration into a scalable platform. This requires modern cloud infrastructure, up-to-date & well-organized data, engineering practices suitable for operating AI at scale, and processes to ensure AI is used safely and responsibly. So, what does a company actually need to do to make this work? Build platforms, not just projects A company must design and build reusable foundations. Reliable frameworks have to support products and teams, rather than creating isolated projects. This means the company must be able to create reference architectures, standard templates, clear approaches, and clear processes for how teams work. Security, cost control, and operational monitoring must be built into the platform design at the start, not added later. Modernise applications, not just move them A company must migrate from lift-and-shift systems to cloud-native ones. This calls for skills in refactoring, containerisation, breaking mon...