Skip to main content

Managing different document versions in the same collection of Azure Cosmos DB


Azure CosmosDB, with the NoSQL approach of storing data, enables us to have flexible schemas. It gives us the power to store multiple version of the same document (object) in the same collection (list) inside a NoSQL data store.
Depending on the business requirements and what other system needs to be integrated with your data store, having multiple document types in the same collection can be tricky and can create issues for the consumer. 
The tricky part is when you start to have multiple versions of the same document stored in a collection. The application(s) that consume the documents needs to start to be aware of multiple versions of the same document type and the ability to manage them differently. It involves transformations or managing the documents in different ways.

Imagine a system that runs in production for 5 years. Every 6 months, a new version of the system is released, where a part of the documents that are stored in Azure CosmosDB have new fields or the current ones are modified. 
With a flexible schema, this is not an issue. When you have 2 or 3 consumers of the same datastore things become more complicated. Each consumer needs to implement the transformation layer, aligning multiple teams and taking into account the computation cost of transformation. When one of the consumers is a reporting system, the latency and extra cost generated by the transformation layer becomes a pain. 
Let’s take a look at different approaches that can be used to mitigate this situation. There is no perfect solution because each situation is different. At this moment in time, there is no out of the box solution that would allow us to migrate documents to a new version.

Option 1 – Bulk migration to a new version
It is a classical solution that works in any situations. Yes, it is expensive, might be an anti-pattern, but it solves your main problem. It involves running a query over all the documents that need to be updated and do a document transformation. 
You might have challenges with this approach from two perspectives.
  • (1) The first one is related to time and how long the migration might take. During that period, your application might not work as expected. For a small database, this might be 1 or 2 hours, but if you have millions of documents that are impacted, this can take a while. 

  • (2) The second challenge is from the reliability and orchestration point of view. In the case of an error, the solution needs to be able to resume from the last document that was updated. It might not sound complicated, but when you need to update 10M documents, you want to have a reliable solution, that does not require verification after a run. 

When you design a solution to do a bulk migration, you want to have the ability to scale easilty and not to consume resources that are allocated for your application. Azure Data Factory gives the ability to define a pipeline that can handle the migration. The ability to resume and track errors automatically makes Azure Data Factory a strong candidate. 
The support for SSIS inside Azure Data Factory enables us to build the solution on our on-premises environments (dev machines). The testing can be done on our own machine and push the SSIS package inside Azure Data Factory when we want to run the migration. There is no need to know SQL, SSIS has support for programming languages like C#, Java, JavaScipt that can be used to implement the transforms. 

A pipeline inside Azure Data Factory runs inside his sandbox and scale-out automatically. Additional to this, the orchestration factory can process multiple transformations in parallel without having to do something special from the SSIS package (your code).

Option 2 – Step by Step version upgrade
It takes into account that not all the documents are used every day. It means that a part of the document doesn’t need to be upgraded to the new version ASAP. This allows a step by step upgrade with a low impact on computation resources when a new version is a roll-out.
During the read phase, the access and persistence layer of the application needs to be able to identify a document with an older version. For these documents, a transformation to the latest version is required. It involves building and managing the transformers for all previous version. 

The solution works great as long as you have only one consumer for your documents. The challenge appears when you have another system, like the reporting and analytics layer. The tricky part here is that you need to ensure that the other system can do the data transformation. If you can share with them package for it is great, but even so, the latency that you can add to the reporting system to do document transformation can impact the system. 
The challenge is when the transformation package cannot be shared between the teams. Each team needs to implement the transforms, and you need to ensure that you have strong communication between them. The chances that something goes wrong are much higher.

All the systems could access the content using a data API layer, that would ensure that the transformation is done in one location. It works great if you have applications that fetch only new data or execute simple queries. If the reports are generated each time on top of the collections, you
might have performance issues for large storages.  

Conclusion
There is no perfect solution for these scenarios. Ignoring document versioning requires to be aware of multiple version that can be tricky after a few releases. Doing the documents versioning upgrade in bulk is feasible, but might affect the data consistency during the update.
A hybrid approach would be to do the bulk transformation and keep at the persistence layer the transformers from the previous version to the current one to ensure that during migration the system is available and running as expected.

Comments

Popular posts from this blog

Why Database Modernization Matters for AI

  When companies transition to the cloud, they typically begin with applications and virtual machines, which is often the easier part of the process. The actual complexity arises later when databases are moved. To save time and effort, cloud adoption is more of a cloud migration in an IaaS manner, fulfilling current, but not future needs. Even organisations that are already in the cloud find that their databases, although “migrated,” are not genuinely modernised. This disparity becomes particularly evident when they begin to explore AI technologies. Understanding Modernisation Beyond Migration Database modernisation is distinct from merely relocating an outdated database to Azure. It's about making your data layer ready for future needs, like automation, real-time analytics, and AI capabilities. AI needs high throughput, which can be achieved using native DB cloud capabilities. When your database runs in a traditional setup (even hosted in the cloud), in that case, you will enc...

Cloud Myths: Migrating to the cloud is quick and easy (Pill 2 of 5 / Cloud Pills)

The idea that migration to the cloud is simple, straightforward and rapid is a wrong assumption. It’s a common misconception of business stakeholders that generates delays, budget overruns and technical dept. A migration requires laborious planning, technical expertise and a rigorous process.  Migrations, especially cloud migrations, are not one-size-fits-all journeys. One of the most critical steps is under evaluation, under budget and under consideration. The evaluation phase, where existing infrastructure, applications, database, network and the end-to-end estate are evaluated and mapped to a cloud strategy, is crucial to ensure the success of cloud migration. Additional factors such as security, compliance, and system dependencies increase the complexity of cloud migration.  A misconception regarding lift-and-shits is that they are fast and cheap. Moving applications to the cloud without changes does not provide the capability to optimise costs and performance, leading to ...

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills)

Cloud Myths: Cloud is Cheaper (Pill 1 of 5 / Cloud Pills) The idea that moving to the cloud reduces the costs is a common misconception. The cloud infrastructure provides flexibility, scalability, and better CAPEX, but it does not guarantee lower costs without proper optimisation and management of the cloud services and infrastructure. Idle and unused resources, overprovisioning, oversize databases, and unnecessary data transfer can increase running costs. The regional pricing mode, multi-cloud complexity, and cost variety add extra complexity to the cost function. Cloud adoption without a cost governance strategy can result in unexpected expenses. Improper usage, combined with a pay-as-you-go model, can result in a nightmare for business stakeholders who cannot track and manage the monthly costs. Cloud-native services such as AI services, managed databases, and analytics platforms are powerful, provide out-of-the-shelve capabilities, and increase business agility and innovation. H...