Skip to main content

Managing different document versions in the same collection of Azure Cosmos DB


Azure CosmosDB, with the NoSQL approach of storing data, enables us to have flexible schemas. It gives us the power to store multiple version of the same document (object) in the same collection (list) inside a NoSQL data store.
Depending on the business requirements and what other system needs to be integrated with your data store, having multiple document types in the same collection can be tricky and can create issues for the consumer. 
The tricky part is when you start to have multiple versions of the same document stored in a collection. The application(s) that consume the documents needs to start to be aware of multiple versions of the same document type and the ability to manage them differently. It involves transformations or managing the documents in different ways.

Imagine a system that runs in production for 5 years. Every 6 months, a new version of the system is released, where a part of the documents that are stored in Azure CosmosDB have new fields or the current ones are modified. 
With a flexible schema, this is not an issue. When you have 2 or 3 consumers of the same datastore things become more complicated. Each consumer needs to implement the transformation layer, aligning multiple teams and taking into account the computation cost of transformation. When one of the consumers is a reporting system, the latency and extra cost generated by the transformation layer becomes a pain. 
Let’s take a look at different approaches that can be used to mitigate this situation. There is no perfect solution because each situation is different. At this moment in time, there is no out of the box solution that would allow us to migrate documents to a new version.

Option 1 – Bulk migration to a new version
It is a classical solution that works in any situations. Yes, it is expensive, might be an anti-pattern, but it solves your main problem. It involves running a query over all the documents that need to be updated and do a document transformation. 
You might have challenges with this approach from two perspectives.
  • (1) The first one is related to time and how long the migration might take. During that period, your application might not work as expected. For a small database, this might be 1 or 2 hours, but if you have millions of documents that are impacted, this can take a while. 

  • (2) The second challenge is from the reliability and orchestration point of view. In the case of an error, the solution needs to be able to resume from the last document that was updated. It might not sound complicated, but when you need to update 10M documents, you want to have a reliable solution, that does not require verification after a run. 

When you design a solution to do a bulk migration, you want to have the ability to scale easilty and not to consume resources that are allocated for your application. Azure Data Factory gives the ability to define a pipeline that can handle the migration. The ability to resume and track errors automatically makes Azure Data Factory a strong candidate. 
The support for SSIS inside Azure Data Factory enables us to build the solution on our on-premises environments (dev machines). The testing can be done on our own machine and push the SSIS package inside Azure Data Factory when we want to run the migration. There is no need to know SQL, SSIS has support for programming languages like C#, Java, JavaScipt that can be used to implement the transforms. 

A pipeline inside Azure Data Factory runs inside his sandbox and scale-out automatically. Additional to this, the orchestration factory can process multiple transformations in parallel without having to do something special from the SSIS package (your code).

Option 2 – Step by Step version upgrade
It takes into account that not all the documents are used every day. It means that a part of the document doesn’t need to be upgraded to the new version ASAP. This allows a step by step upgrade with a low impact on computation resources when a new version is a roll-out.
During the read phase, the access and persistence layer of the application needs to be able to identify a document with an older version. For these documents, a transformation to the latest version is required. It involves building and managing the transformers for all previous version. 

The solution works great as long as you have only one consumer for your documents. The challenge appears when you have another system, like the reporting and analytics layer. The tricky part here is that you need to ensure that the other system can do the data transformation. If you can share with them package for it is great, but even so, the latency that you can add to the reporting system to do document transformation can impact the system. 
The challenge is when the transformation package cannot be shared between the teams. Each team needs to implement the transforms, and you need to ensure that you have strong communication between them. The chances that something goes wrong are much higher.

All the systems could access the content using a data API layer, that would ensure that the transformation is done in one location. It works great if you have applications that fetch only new data or execute simple queries. If the reports are generated each time on top of the collections, you
might have performance issues for large storages.  

Conclusion
There is no perfect solution for these scenarios. Ignoring document versioning requires to be aware of multiple version that can be tricky after a few releases. Doing the documents versioning upgrade in bulk is feasible, but might affect the data consistency during the update.
A hybrid approach would be to do the bulk transformation and keep at the persistence layer the transformers from the previous version to the current one to ensure that during migration the system is available and running as expected.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(...

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine: threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration: TeamCity .NET 4.51 EF 6.0.2 VS2013 It see...

Navigating Cloud Strategy after Azure Central US Region Outage

 Looking back, July 19, 2024, was challenging for customers using Microsoft Azure or Windows machines. Two major outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. These two outages affected many people and put many businesses on pause for a few hours or even days. The overlap of these two issues was a nightmare for travellers. In addition to blue screens in the airport terminals, they could not get additional information from the airport website, airline personnel, or the support line because they were affected by the outage in the Central US region or the CrowdStrike outage.   But what happened in reality? A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provi...