Azure CosmosDB, with the NoSQL approach of storing data, enables us to have flexible schemas. It gives us the power to store multiple version of the same document (object) in the same collection (list) inside a NoSQL data store.
Depending on the business requirements and what other system needs to be integrated with your data store, having multiple document types in the same collection can be tricky and can create issues for the consumer.
The tricky part is when you start to have multiple versions of the same document stored in a collection. The application(s) that consume the documents needs to start to be aware of multiple versions of the same document type and the ability to manage them differently. It involves transformations or managing the documents in different ways.
Imagine a system that runs in production for 5 years. Every 6 months, a new version of the system is released, where a part of the documents that are stored in Azure CosmosDB have new fields or the current ones are modified.
With a flexible schema, this is not an issue. When you have 2 or 3 consumers of the same datastore things become more complicated. Each consumer needs to implement the transformation layer, aligning multiple teams and taking into account the computation cost of transformation. When one of the consumers is a reporting system, the latency and extra cost generated by the transformation layer becomes a pain.
Let’s take a look at different approaches that can be used to mitigate this situation. There is no perfect solution because each situation is different. At this moment in time, there is no out of the box solution that would allow us to migrate documents to a new version.
Option 1 – Bulk migration to a new version
It is a classical solution that works in any situations. Yes, it is expensive, might be an anti-pattern, but it solves your main problem. It involves running a query over all the documents that need to be updated and do a document transformation.
You might have challenges with this approach from two perspectives.
- (1) The first one is related to time and how long the migration might take. During that period, your application might not work as expected. For a small database, this might be 1 or 2 hours, but if you have millions of documents that are impacted, this can take a while.
- (2) The second challenge is from the reliability and orchestration point of view. In the case of an error, the solution needs to be able to resume from the last document that was updated. It might not sound complicated, but when you need to update 10M documents, you want to have a reliable solution, that does not require verification after a run.
When you design a solution to do a bulk migration, you want to have the ability to scale easilty and not to consume resources that are allocated for your application. Azure Data Factory gives the ability to define a pipeline that can handle the migration. The ability to resume and track errors automatically makes Azure Data Factory a strong candidate.
The support for SSIS inside Azure Data Factory enables us to build the solution on our on-premises environments (dev machines). The testing can be done on our own machine and push the SSIS package inside Azure Data Factory when we want to run the migration. There is no need to know SQL, SSIS has support for programming languages like C#, Java, JavaScipt that can be used to implement the transforms.
A pipeline inside Azure Data Factory runs inside his sandbox and scale-out automatically. Additional to this, the orchestration factory can process multiple transformations in parallel without having to do something special from the SSIS package (your code).
Option 2 – Step by Step version upgrade
It takes into account that not all the documents are used every day. It means that a part of the document doesn’t need to be upgraded to the new version ASAP. This allows a step by step upgrade with a low impact on computation resources when a new version is a roll-out.
During the read phase, the access and persistence layer of the application needs to be able to identify a document with an older version. For these documents, a transformation to the latest version is required. It involves building and managing the transformers for all previous version.
The solution works great as long as you have only one consumer for your documents. The challenge appears when you have another system, like the reporting and analytics layer. The tricky part here is that you need to ensure that the other system can do the data transformation. If you can share with them package for it is great, but even so, the latency that you can add to the reporting system to do document transformation can impact the system.
The challenge is when the transformation package cannot be shared between the teams. Each team needs to implement the transforms, and you need to ensure that you have strong communication between them. The chances that something goes wrong are much higher.
All the systems could access the content using a data API layer, that would ensure that the transformation is done in one location. It works great if you have applications that fetch only new data or execute simple queries. If the reports are generated each time on top of the collections, you
might have performance issues for large storages.
There is no perfect solution for these scenarios. Ignoring document versioning requires to be aware of multiple version that can be tricky after a few releases. Doing the documents versioning upgrade in bulk is feasible, but might affect the data consistency during the update.
A hybrid approach would be to do the bulk transformation and keep at the persistence layer the transformers from the previous version to the current one to ensure that during migration the system is available and running as expected.