Skip to main content

Managing different document versions in the same collection of Azure Cosmos DB


Azure CosmosDB, with the NoSQL approach of storing data, enables us to have flexible schemas. It gives us the power to store multiple version of the same document (object) in the same collection (list) inside a NoSQL data store.
Depending on the business requirements and what other system needs to be integrated with your data store, having multiple document types in the same collection can be tricky and can create issues for the consumer. 
The tricky part is when you start to have multiple versions of the same document stored in a collection. The application(s) that consume the documents needs to start to be aware of multiple versions of the same document type and the ability to manage them differently. It involves transformations or managing the documents in different ways.

Imagine a system that runs in production for 5 years. Every 6 months, a new version of the system is released, where a part of the documents that are stored in Azure CosmosDB have new fields or the current ones are modified. 
With a flexible schema, this is not an issue. When you have 2 or 3 consumers of the same datastore things become more complicated. Each consumer needs to implement the transformation layer, aligning multiple teams and taking into account the computation cost of transformation. When one of the consumers is a reporting system, the latency and extra cost generated by the transformation layer becomes a pain. 
Let’s take a look at different approaches that can be used to mitigate this situation. There is no perfect solution because each situation is different. At this moment in time, there is no out of the box solution that would allow us to migrate documents to a new version.

Option 1 – Bulk migration to a new version
It is a classical solution that works in any situations. Yes, it is expensive, might be an anti-pattern, but it solves your main problem. It involves running a query over all the documents that need to be updated and do a document transformation. 
You might have challenges with this approach from two perspectives.
  • (1) The first one is related to time and how long the migration might take. During that period, your application might not work as expected. For a small database, this might be 1 or 2 hours, but if you have millions of documents that are impacted, this can take a while. 

  • (2) The second challenge is from the reliability and orchestration point of view. In the case of an error, the solution needs to be able to resume from the last document that was updated. It might not sound complicated, but when you need to update 10M documents, you want to have a reliable solution, that does not require verification after a run. 

When you design a solution to do a bulk migration, you want to have the ability to scale easilty and not to consume resources that are allocated for your application. Azure Data Factory gives the ability to define a pipeline that can handle the migration. The ability to resume and track errors automatically makes Azure Data Factory a strong candidate. 
The support for SSIS inside Azure Data Factory enables us to build the solution on our on-premises environments (dev machines). The testing can be done on our own machine and push the SSIS package inside Azure Data Factory when we want to run the migration. There is no need to know SQL, SSIS has support for programming languages like C#, Java, JavaScipt that can be used to implement the transforms. 

A pipeline inside Azure Data Factory runs inside his sandbox and scale-out automatically. Additional to this, the orchestration factory can process multiple transformations in parallel without having to do something special from the SSIS package (your code).

Option 2 – Step by Step version upgrade
It takes into account that not all the documents are used every day. It means that a part of the document doesn’t need to be upgraded to the new version ASAP. This allows a step by step upgrade with a low impact on computation resources when a new version is a roll-out.
During the read phase, the access and persistence layer of the application needs to be able to identify a document with an older version. For these documents, a transformation to the latest version is required. It involves building and managing the transformers for all previous version. 

The solution works great as long as you have only one consumer for your documents. The challenge appears when you have another system, like the reporting and analytics layer. The tricky part here is that you need to ensure that the other system can do the data transformation. If you can share with them package for it is great, but even so, the latency that you can add to the reporting system to do document transformation can impact the system. 
The challenge is when the transformation package cannot be shared between the teams. Each team needs to implement the transforms, and you need to ensure that you have strong communication between them. The chances that something goes wrong are much higher.

All the systems could access the content using a data API layer, that would ensure that the transformation is done in one location. It works great if you have applications that fetch only new data or execute simple queries. If the reports are generated each time on top of the collections, you
might have performance issues for large storages.  

Conclusion
There is no perfect solution for these scenarios. Ignoring document versioning requires to be aware of multiple version that can be tricky after a few releases. Doing the documents versioning upgrade in bulk is feasible, but might affect the data consistency during the update.
A hybrid approach would be to do the bulk transformation and keep at the persistence layer the transformers from the previous version to the current one to ensure that during migration the system is available and running as expected.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

How to audit an Azure Cosmos DB

In this post, we will talk about how we can audit an Azure Cosmos DB database. Before jumping into the problem let us define the business requirement: As an Administrator I want to be able to audit all changes that were done to specific collection inside my Azure Cosmos DB. The requirement is simple, but can be a little tricky to implement fully. First of all when you are using Azure Cosmos DB or any other storage solution there are 99% odds that you’ll have more than one system that writes data to it. This means that you have or not have control on the systems that are doing any create/update/delete operations. Solution 1: Diagnostic Logs Cosmos DB allows us activate diagnostics logs and stream the output a storage account for achieving to other systems like Event Hub or Log Analytics. This would allow us to have information related to who, when, what, response code and how the access operation to our Cosmos DB was done. Beside this there is a field that specifies what was th...

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills) The idea that moving to the cloud reduces the costs is a common misconception. The cloud infrastructure provides flexibility, scalability, and better CAPEX, but it does not guarantee lower costs without proper optimisation and management of the cloud services and infrastructure. Idle and unused resources, overprovisioning, oversize databases, and unnecessary data transfer can increase running costs. The regional pricing mode, multi-cloud complexity, and cost variety add extra complexity to the cost function. Cloud adoption without a cost governance strategy can result in unexpected expenses. Improper usage, combined with a pay-as-you-go model, can result in a nightmare for business stakeholders who cannot track and manage the monthly costs. Cloud-native services such as AI services, managed databases, and analytics platforms are powerful, provide out-of-the-shelve capabilities, and increase business agility and innovation. H...