Skip to main content

Blueprint of a cloud data store for objects with state updated often (Azure and AWS)

This post will focus on how we shall design a cloud system inside Azure and AWS that needs to handle objects with the state changing very often. 

Proposition 
I want to have a system that can store objects in a storage with a flexible schema. A part of the objects are updated every day and queries are per object. Aggregation reports are done inside a data warehouse solution, not directly inside this storage. 

Requirements  
Let’s imagine a client that needs a cloud solution with the following requirements: 
  • 500M objects stored  
  • 1M new objects added every day 
  • 10M objects state are changed every day 
  • Update operation shall be under 0.3s 
  • Query per objects shall be under 0.2s 
  • 30M queries per day that check the object state 
  • Dynamic schema of the objects 
  • Except for object state, all other attributes are written one time (during insert operation) 
Solution overview 
We need a NoSQL solution to store the objects. Even so, the challenging part is to design a solution that enables us to do fast updates on the object state and keep the cost under control. By using a NoSQL solution the size of the storage is not a problem. Having 500M or 1000M objects is the same thing as long as we are doing the partitioning the right way from the beginning.  
Because most of the updates and queries are on the state attribute of the object we can optimize the storage by adding an index on the state field if necessary. 
Even if we have a NoSQL solution, having a high number of operations would create similar bottlenecks as for a relational database. Besides this, we need to take the cost into account and try to optimize the consumption as much as possible.  
The proposed solution is a hybrid one that combines two different types of NoSQL solutions. Object attributes are stored inside a document DB storage except for the state attribute. The state attribute is stored inside key-value storage, that it is optimized for a high number of write and reads.  
The latency may increase a little because if you need to load an object completely, you need to query two storages, but at the same time because of the key-value storage, you can retrieve easily the object state based on the object ID. 
The cost of storing the data inside a key-value database with items that are very often updated is much lower in comparison with a document DB storage. 
In the next part of the post will take a look on how the solution would look like inside AWS and Azure. 

Azure Approach 
The data layer would use two different types of storages from Azure Cosmos DB. The first type of storage is DocumentDB that would be used to store the objects information inside Azure Cosmos DB. All objects attributes are stored inside it except the object state attribute. 
The object state attribute is stored inside Tables. This key-value stored is optimized for a high number of writes. To reduce the running cost, even more, we would replace the Tables, that are part of Azure Cosmos DB with Azure Tables. Azure Tables are a good option for us as long we limit our queries per objects ID (key) and we don’t try to run complex queries. 
Inside Azure Cosmos DB we have a level of control at the partitioning level, but for Azure Table we might hit some limitations. Because of this, if we go with an approach where we use Azure Tables the Partition Key shall be the hash of the object ID and the key the object itself. Also, if the number of transactions per Azure Table is higher than 20K/second, multiple Storage accounts might be required. If you don’t want to manage these possible issues and reduce risk then you should go with Tables from Azure Cosmos DB. 
Azure Cosmos DB can scale automatically and has a DR strategy that is very powerful and easy to use. It’s one of the best NoSQL solutions that are on the market, and when it is well configured is amazing. Automatic DR and data replication across regions is available, but everything comes with a cost, especially from operation part ($$$) 

AWS Approach 
The approach inside AWS is similar, but it is built on top of AWS DocumentDB to store the object attributes and AWS DynamoDB to store the object state. AWS DynamoDB is one of the best key-value data stores available on the market. When data consistency and DR are not top priorities AWS DynamoDB is your best choice. Besides scaling and speed capabilities, AWS DynamoDB is enabling us to push a stream of data to the AWS Redshift. Any update of the data is automatically pushed to the data warehouse, allowing us to have an out of the box system that sends the updates to the data warehouse.  
AWS DocumentDB fulfils his job very good, being able to store 500M of objects without any issues.  

Final thoughts  
Splitting the data storage into 2 different types of storage can be a right choice when there only a small subset of the fields are updated very often and the rest of them are written only one time – during the insert operation. Combining the power of document DB storage with key-value pair storage enables us to design a system that can manage a high throughput easily 
Both cloud providers offer services that match our needs that are highly scalable and cheap from the operational perspective. Inside Azure, this can be achieved by combining DocumentDB and Tables from Azure Cosmos DB. For AWS ecosystem we would need to use AWS DocumentDB and AWS DynamoDB. 

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills) The idea that moving to the cloud reduces the costs is a common misconception. The cloud infrastructure provides flexibility, scalability, and better CAPEX, but it does not guarantee lower costs without proper optimisation and management of the cloud services and infrastructure. Idle and unused resources, overprovisioning, oversize databases, and unnecessary data transfer can increase running costs. The regional pricing mode, multi-cloud complexity, and cost variety add extra complexity to the cost function. Cloud adoption without a cost governance strategy can result in unexpected expenses. Improper usage, combined with a pay-as-you-go model, can result in a nightmare for business stakeholders who cannot track and manage the monthly costs. Cloud-native services such as AI services, managed databases, and analytics platforms are powerful, provide out-of-the-shelve capabilities, and increase business agility and innovation. H...