Skip to main content

Architectures and Lessons | When the cloud Must run hundreds of thousands of vCPUs

 Running hundreds of thousands of virtual CPUs in the cloud requires a new approach. It involves not just more machines, but also different thinking, operations, and cost management. In this article, I present practical architecture patterns, real-world trade-offs, and operational lessons for teams evolving from small experiments to resilient, multi-cloud platforms for highly parallel workloads.

Control plane and executors — a simple mental model

A helpful model for these platforms is to separate them into two roles: the control plane and the executors. The control plane manages APIs, scheduling, authentication, metadata, and billing. Executors are the compute resources, such as VM pools, containers, or bare-metal servers.
This separation is important for portability. If the control plane defines workloads and abstracts cloud-specific details behind adapters, you can connect multiple execution environments, including various clouds or on-premises clusters. The control plane should remain small, reliable, and fast, while executors should be cost-effective, highly parallel, and efficient at transferring data to and from object storage.

What breaks first at a very large scale

At large scales, several challenges emerge. Job submission throughput becomes critical, as thousands of submissions per second can overwhelm a basic API or single database. Relying on polling can cause event storms, flooding the control plane with millions of status messages. Data I/O often becomes a bottleneck; shared POSIX filesystems may fail under load, while object storage with parallel streams performs better. Scheduling latency is also significant, as slow placement increases overall runtime. Finally, without effective chargeback and tagging, costs can escalate and erode trust in the platform.
Operational noise also increases, as rare and complex failures become more frequent with additional components. Without strong automation and diagnostics, operating costs will rise.

Patterns that work in practice

Several patterns have proven practical and repeatable when designing for extreme scale. While not comprehensive solutions, they provide a solid foundation for further iteration.
First, maintain a lightweight and agnostic control plane. Keep it mostly stateless, using small, durable stores for metadata. The control plane should determine what runs, not how the cloud executes it. Abstract cloud differences behind adapters to reduce the cost of adding new environments.
Second, delegate data movement to the edges. Use signed URLs or pre-signed tokens so workers interact directly with object storage. This prevents the control plane from becoming a proxy and allows storage systems to scale. Break large inputs into multiple objects to enable parallel transfers.
Third, implement a hybrid scheduling approach. A single global scheduler can become a bottleneck. Instead, use a local allocator in each region or cluster for immediate placement, with a global coordinator managing policy and fairness. This ensures low-latency task placement and maintains multi-tenant correctness.
Fourth, design for spot or interruptible instances to achieve cost efficiency at scale. Ensure tasks are restartable or support checkpointing and resuming. Combine spot and on-demand capacity to guarantee completion of critical jobs.
Lastly, measure costs at the source. Tag resources and record job-level metrics within the platform, rather than relying only on cloud billing. This supports proactive limits and reliable chargeback.

Choosing a scheduler — managed versus autonomous.

Choosing between cloud-managed batch systems and self-managed schedulers involves trade-offs. Managed services offer rapid deployment and handle many operational tasks, but may introduce latency and functional constraints. Autonomous schedulers, whether open source or custom, provide full control and optimization for fast launches and specialized placement, but require greater operational effort, including server management, high availability, upgrades, and additional staffing.
In practice, a hybrid approach is often most effective. For many workloads, managed batch services are the simplest and most cost-effective choice. For large-scale or specialized requirements, such as unique CPU types, co-location, or specific network topologies, an autonomous pool offers greater control. Begin with a simple solution and add complexity only when necessary.

Data and IO: the real bottleneck

At large scale, data movement is often more expensive than computing. For high-throughput workloads, object storage with many concurrent readers and writers is most effective. Avoid single large shared filesystems unless you have robust caching and parallel access layers.
Cache common inputs close to compute resources. If many jobs require the same large files, use a regional cache or CDN to reduce egress and accelerate startup times. Employ multipart uploads and parallel downloads on workers, and plan for data retention with short-term hot storage for new outputs and cold archival for older results.
When possible, design for small units of work. Smaller tasks restart more quickly after spot interruptions and limit the impact of failures.

Observability — you must automate operations

At extreme scale, manual operations are not feasible. The platform must be observable, and common responses should be automated.
Use event-driven status updates from workers, pushing state to an event bus or webhook instead of relying on polling. Aggregate metrics and implement basic anomaly detection for early warnings. Attach trace IDs to job lifecycles to track progress across components. Deploy control-plane changes with canary releases and enable automated rollbacks when issues occur.
Codify runbooks and automate routine recoveries. The objective is not to remove human operators, but to allow them to focus on complex decisions rather than repetitive tasks.

Generative AI as a helper, not a hero

Generative models are valuable support tools for daily operations. Use them for tasks where pattern recognition or language processing is a bottleneck, such as recommending job placements based on historical telemetry, drafting runbooks for recurring incidents, or generating initial cost estimates. These tools save time and reduce errors when configured with guardrails, audit logs, and human approval steps. However, always maintain explainability; engineers must understand the rationale behind recommendations, especially when they affect costs or compliance.

Trade-offs — accept iteration

Every decision involves trade-offs. Flexibility in design often increases operational costs. Portability may limit cloud-specific optimizations. Spot instances reduce expenses but add complexity. Strong consistency is straightforward but difficult to scale, while eventual consistency scales more easily but demands disciplined operations.
Recognize that iteration is necessary. Validate the riskiest assumptions early with a focused proof of concept, then expand. For many teams, testing data throughput and scheduling latency provides the most value; if these are satisfactory, other challenges can be addressed incrementally.

Practical checklist before you scale to 100k+ vCPU

This checklist provides concrete items to verify before scaling to very large workloads. Use it as a guide for proof-of-concept validation and for identifying areas to strengthen later.
  • Separation: control plane and executors are clearly separated.
  • Event model: workers push events; avoid large-scale polling.
  • Data plan: signed URLs, multipart transfers, and regional caches for common inputs.
  • Scheduler design: two-level model (global policy, local fast allocator) or a proven autonomous scheduler.
  • Spot strategy: tasks are restartable or checkpointed; have some guaranteed capacity.
  • Cost tagging: job-level cost tracking and proactive chargeback.
  • Observability: tracing, aggregated metrics, and anomaly detection in place.
  • CI/CD: safe deployment pipeline with canaries and automated rollbacks.
  • Automation & runbooks: automated recovery for common failures.
  • GenAI guardrails: audit logs and human approval for critical suggestions.
Final words
Scaling to hundreds of thousands of vCPUs is both a technical and organizational challenge. It requires clear building blocks: a small, portable control plane, fast local allocation for low latency, event-driven worker updates, object storage for I/O, and disciplined cost management. By combining these with automation, observability, and thoughtful use of AI for decision support, you create a platform that is not only larger, but also more predictable and robust.
Begin by addressing the riskiest areas, iterate quickly, and ensure teams are supported with effective tools and runbooks. This approach is challenging but practical, and when successful, delivers transformative results.

Comments

Popular posts from this blog

Why Database Modernization Matters for AI

  When companies transition to the cloud, they typically begin with applications and virtual machines, which is often the easier part of the process. The actual complexity arises later when databases are moved. To save time and effort, cloud adoption is more of a cloud migration in an IaaS manner, fulfilling current, but not future needs. Even organisations that are already in the cloud find that their databases, although “migrated,” are not genuinely modernised. This disparity becomes particularly evident when they begin to explore AI technologies. Understanding Modernisation Beyond Migration Database modernisation is distinct from merely relocating an outdated database to Azure. It's about making your data layer ready for future needs, like automation, real-time analytics, and AI capabilities. AI needs high throughput, which can be achieved using native DB cloud capabilities. When your database runs in a traditional setup (even hosted in the cloud), in that case, you will enc...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

[Post Event] Azure AI Connect, March 2025

On March 13th, I had the opportunity to speak at Azure AI Connect about modern AI architectures.  My session focused on the importance of modernizing cloud systems to efficiently handle the increasing payload generated by AI.