Cloud is sold out like an environment without limits, where you can have as many resource you want. The true is that each service offered by a cloud provider has some limitations – like a normal service. For each service we can mitigate in different way this boundaries.
But what is happening when we reach the limits of a service? For a part of this services, there are best practices related on how we can scale over a service limitation. But for others services we need to put our mind at work and find different ways how you can scale above cloud limits.
In today post we will talk about Azure Storage and what we should do if we want to scale over Azure Storage limits.
Before jumping to the problem, let’s see some limitations of Azure Storage.
The biggest limitation is not the storage capacity that can reach 500 TB of data per Azure account – the biggest problem is the bandwidth. Based on the next two characteristics, the available outbound bandwidth can vary from 10 gigabits to 30 gigabyte per second:
- Region (location where the account was created)
- Type of storage redundancy
10 gigabytes of outbound traffic already sounds very good. We should take into account that this value is per Azure Storage Account and not per file. On top of the total bandwidth that can be reached using a storage account there are some limitations at partition level.
Before going deeper on this topic, let’s see what does a partition means in this case. A partition is represented by an object of Azure Storage that affect and determinate how load balancing and traffic manager should work.
For each component of Azure Storage a partition is represented by a different scalability unit:
- Azure Blobs - A partition is represented by Container + Blob
- Azure Tables – A partition is represented by Table + Partition
- Azure Queues – A partition is represented by a Queue
What does this mean? It means that we have a maximum outbound traffic limit that can be reached by a partition and we need to be aware about it.
In this moment, the maximum throughput limits that can be reached on each partition are:
- Azure Blobs – 60MB/s or 500 requests/s
- Azure Tables – 2.000 entities/s
- Azure Queues – 2.000 messages/s
In this moment we already identify some limitations at partition level that can affect our system. Let’s see what we can do to improve it.
Imagine a system that store large amounts of data on Azure Storage. This information contains software updates that will reach millions of devices would wide. Because we need a secure mechanism to access this content (Shared Access Signature – token base access) we will need to store it on blobs.
Azure CDNs are not used because there is no support for Shared Access Signature or other security mechanism. A possible solution would be to encrypt the content and distributed it over Azure CDN network. But the cost of decryption at device level is too high (from CPU perspective).
Because of this in this moment we are limited to 60 MB/s or up to 500 requests/s. If on average, the download bandwidth of each device is 512 KB/s we can have maximum 120 devices that can download the content with the maximum speed. Or we could have 500 devices that download the content at a rate of 122 KB/s.
- 120 devices at 512 KB/s
- 500 devices at 122 KB/s
Going further, a package that has 100 MB, will reach maximum 500 devices in 14 minutes. If we put on the table the network connectivity issues that can appear at device level we end up with maximum 300 devices that will download with success that package every 14 minutes.
But in the world of IoT, where we can have 50.000 devices, this would mean that we will be able to distribute this package to all our devices in 39 hours.
Upsss … as a client I expect more than that, I need to be able to push a package to this devices in a shorter period of time.
Before going further let’s make a summary.
The download limit for a blob is 60 MB/s or up to 500 request per second. A package that has 100 MB, will be deliver to 50.000 devices in 39 hours.
To be able to reduce the package distribution time we need to be clever. Having only one location where the content is store can be simple to maintain, but don’t allow us to scale over the account limitations.
Nothing stop us to copy a package to multiple regions and datacenters. In this way we can have a package distributed to 3 different datacenters using 3 different storage accounts. From performance perspective, this would mean that we can distribute a package to 900 devices in 14 minutes. This mean that we would be able to distribute our package to all 50.000 devices in only 13 hours.
- 1 replication…300 devices reached every 14 minutes
- 3 replication…900 devices reached every 14 minutes
But what if we need to distribute the same package to 500.000 devices? We can apply the same technique but more clever. Let’s have the following deployment configuration:
- Replicate content in 5 datacenters
- In each datacenter replicate the content on 3 different Azure Storage accounts
Remarks: We could replicate in the in the same Azure Storage account the content in 3 different blobs, but personally I prefer to use different Azure Storage accounts (in this way the risk to reach the storage accounts limits is lower).
Having the content replicated in 5 datacenters on 3 Azure Storage accounts would mean that we have the content replicated in 15 different locations.
The above configuration would allow us to reach:
- 4.500 devices every 14 minutes
- 50.000 devices in 2.5 hours
- 100.000 devices in 5.2 hours
- 500.000 devices in 26 hours
In this moment we have 16 different Azure datacenters public available would wide. Imagine making a package deployment in all the datacenters on 3 different accounts per datacenter. This would allow us to reach 14.400 devices every 14 minutes and 500.000 devices in 8.2 hours.
From technical perspective, it is pretty clear that we would add some complexity to our system – no pain no game (smile).
There will be two components that will become more complex:
First one, is the component that manage the package upload. This component needs to be able to replicate the package payload on all the datacenters and Storage Accounts where we want to store the package.
The second components that will become more complex is the component that resolve the blob download URL for each device. This component needs to be able to uniform distribute the load across all available nodes. If you can partition your devices based on region, location or anything else, than you will be able to do this pretty simple.
In the near future I will write some posts that will talk more about the two components.
Let’s see how the cost will be affected by a solution that replicate the content in multiple datacenters. Let’s see how much would cost us to download 100 MB of data from Azure Blobs by 500.000 devices.
Remarks: Because the download costs is different based on Zone (datacenter location), we will assume that all datacenters are in the Zone 1, with traffic per month of 60 TB – Outbound traffic cost per GB would be €0.0522. Also we assume the storage account is 30TB with only Local Redundancy Storage – Storage cost per month for 1 GB of data will be €0.022.
500.000 devices X 100 MB / 1024 = 48828.125 GB = 47.68 TB
48829 GB X €0.0522 = €2548.8738
For 100 MB:
With only one copy: 100 MB = 0.098 GB X €0.022 = €0.002156
With 3 replications: €0.006468
With 15 replications: €0.03234
With 48 replications: €0.103488 (16 datacenters X 3 Storage Account per datacenter)
As you can see, the cost of storage it is only a small small small part of the total price. The download cost is the one that dominate the total cost.
If we think that we only add an extra cost of less than €0.11 to reduce the distribution for 500.000 devices from 16 days to 8.2 hours it is pretty amazing what we can do with a clever solution.
We can have the package content replicated in 48 storage accounts only in the moment when we release the package. Once we released it and deployed we can keep the content replicated only in 3 nodes for example.
In conclusion I would like to say that even if Azure Storage accounts have some limitations, because in this moment the number of datacenters would wide reached 16, we can do pretty amazing things if we scale horizontally.
With only 11 cents per month, we can reduce the distribution time of a package from 16 days to 8.2 hours. This is the power of a system that is distributed would wide.