Skip to main content

Managing different document versions in the same collection of Azure Cosmos DB


Azure CosmosDB, with the NoSQL approach of storing data, enables us to have flexible schemas. It gives us the power to store multiple version of the same document (object) in the same collection (list) inside a NoSQL data store.
Depending on the business requirements and what other system needs to be integrated with your data store, having multiple document types in the same collection can be tricky and can create issues for the consumer. 
The tricky part is when you start to have multiple versions of the same document stored in a collection. The application(s) that consume the documents needs to start to be aware of multiple versions of the same document type and the ability to manage them differently. It involves transformations or managing the documents in different ways.

Imagine a system that runs in production for 5 years. Every 6 months, a new version of the system is released, where a part of the documents that are stored in Azure CosmosDB have new fields or the current ones are modified. 
With a flexible schema, this is not an issue. When you have 2 or 3 consumers of the same datastore things become more complicated. Each consumer needs to implement the transformation layer, aligning multiple teams and taking into account the computation cost of transformation. When one of the consumers is a reporting system, the latency and extra cost generated by the transformation layer becomes a pain. 
Let’s take a look at different approaches that can be used to mitigate this situation. There is no perfect solution because each situation is different. At this moment in time, there is no out of the box solution that would allow us to migrate documents to a new version.

Option 1 – Bulk migration to a new version
It is a classical solution that works in any situations. Yes, it is expensive, might be an anti-pattern, but it solves your main problem. It involves running a query over all the documents that need to be updated and do a document transformation. 
You might have challenges with this approach from two perspectives.
  • (1) The first one is related to time and how long the migration might take. During that period, your application might not work as expected. For a small database, this might be 1 or 2 hours, but if you have millions of documents that are impacted, this can take a while. 

  • (2) The second challenge is from the reliability and orchestration point of view. In the case of an error, the solution needs to be able to resume from the last document that was updated. It might not sound complicated, but when you need to update 10M documents, you want to have a reliable solution, that does not require verification after a run. 

When you design a solution to do a bulk migration, you want to have the ability to scale easilty and not to consume resources that are allocated for your application. Azure Data Factory gives the ability to define a pipeline that can handle the migration. The ability to resume and track errors automatically makes Azure Data Factory a strong candidate. 
The support for SSIS inside Azure Data Factory enables us to build the solution on our on-premises environments (dev machines). The testing can be done on our own machine and push the SSIS package inside Azure Data Factory when we want to run the migration. There is no need to know SQL, SSIS has support for programming languages like C#, Java, JavaScipt that can be used to implement the transforms. 

A pipeline inside Azure Data Factory runs inside his sandbox and scale-out automatically. Additional to this, the orchestration factory can process multiple transformations in parallel without having to do something special from the SSIS package (your code).

Option 2 – Step by Step version upgrade
It takes into account that not all the documents are used every day. It means that a part of the document doesn’t need to be upgraded to the new version ASAP. This allows a step by step upgrade with a low impact on computation resources when a new version is a roll-out.
During the read phase, the access and persistence layer of the application needs to be able to identify a document with an older version. For these documents, a transformation to the latest version is required. It involves building and managing the transformers for all previous version. 

The solution works great as long as you have only one consumer for your documents. The challenge appears when you have another system, like the reporting and analytics layer. The tricky part here is that you need to ensure that the other system can do the data transformation. If you can share with them package for it is great, but even so, the latency that you can add to the reporting system to do document transformation can impact the system. 
The challenge is when the transformation package cannot be shared between the teams. Each team needs to implement the transforms, and you need to ensure that you have strong communication between them. The chances that something goes wrong are much higher.

All the systems could access the content using a data API layer, that would ensure that the transformation is done in one location. It works great if you have applications that fetch only new data or execute simple queries. If the reports are generated each time on top of the collections, you
might have performance issues for large storages.  

Conclusion
There is no perfect solution for these scenarios. Ignoring document versioning requires to be aware of multiple version that can be tricky after a few releases. Doing the documents versioning upgrade in bulk is feasible, but might affect the data consistency during the update.
A hybrid approach would be to do the bulk transformation and keep at the persistence layer the transformers from the previous version to the current one to ensure that during migration the system is available and running as expected.

Comments

Popular posts from this blog

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine:
threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration:

TeamCity.NET 4.51EF 6.0.2VS2013
It seems that there …

GET call of REST API that contains '/'-slash character in the value of a parameter

Let’s assume that we have the following scenario: I have a public HTTP endpoint and I need to post some content using GET command. One of the parameters contains special characters like “\” and “/”. If the endpoint is an ApiController than you may have problems if you encode the parameter using the http encoder.
using (var httpClient = new HttpClient()) { httpClient.BaseAddress = baseUrl; Task<HttpResponseMessage> response = httpClient.GetAsync(string.Format("api/foo/{0}", "qwert/qwerqwer"))); response.Wait(); response.Result.EnsureSuccessStatusCode(); } One possible solution would be to encode the query parameter using UrlTokenEncode method of HttpServerUtility class and GetBytes method ofUTF8. In this way you would get the array of bytes of the parameter and encode them as a url token.
The following code show to you how you could write the encode and decode methods.
publ…

Entity Framework (EF) TransactionScope vs Database.BeginTransaction

In today blog post we will talk a little about a new feature that is available on EF6+ related to Transactions.
Until now, when we had to use transaction we used ‘TransactionScope’. It works great and I would say that is something that is now in our blood.
using (var scope = new TransactionScope(TransactionScopeOption.Required)) { using (SqlConnection conn = new SqlConnection("...")) { conn.Open(); SqlCommand sqlCommand = new SqlCommand(); sqlCommand.Connection = conn; sqlCommand.CommandText = ... sqlCommand.ExecuteNonQuery(); ... } scope.Complete(); } Starting with EF6.0 we have a new way to work with transactions. The new approach is based on Database.BeginTransaction(), Database.Rollback(), Database.Commit(). Yes, no more TransactionScope.
In the followi…