Skip to main content

How does consistency levels affects the latency of Azure CosmosDB

Azure Cosmos DB has 5 different consistency levels (Strong / Bounded Stateless / Session / Consistent Prefix / Eventual). Each consistency level can affect the latency of operations that we are doing on the storage.
In this post we will try to respond to the following question:

  • What is the latency impact of different consistency level? 

Latency in general
The current SLAs are offering us the guarantee that the READ and WRITE operation is in 99% of the cases under 10ms. The average latency of the current content fetch from Azure Cosmos DB is under 4ms in 50% of the cases.
The write operations are a little slower, reaching maximum 5ms in 50% of the cases. This is applicable only for Bounded Stateless / Session / Consistent Prefix and Eventual.

For Strong consistency and databases across multiple regions, the latency is higher, but this is expected because of the replication requirements. For example if you have Strong consistency on a database that is replicated in two different regions the latency would be equal to 2 roundtrips time between of the hardest regions plus the 10ms latency in 99% of the cases. The extra 10ms comes from the read operation (confirmation) required to ensure that the read operation was done with success.
There is also a thing that you need to take into account:

  • There is NO SLA for the latency between two different Azure Regions.
This means that it is impossible to calculate and have an SLA for Strong consistency. The total latency will be in most of the cases:
  • Strong consistency for 2 regions = 10ms + 2 * roundtrips between the regions calls
, in 99% of the cases.

NOTE: Replication monitoring  - Microsoft Azure is monitoring the replication latency. The information is available from the Azure Portal (Azure Portal / Metrics/ Consistency Level).

The REAL TEST
Take into account that each time when you will run the same or a different test, the result will be different. There are multiple things that can affect the result, including the machine that it is used to do the test.

I run all the test from a Standard_D5_v2 VM, with 16vCores and 56 of memory. Each test ran for 500.000 times and used concept and methodology from Practical Large-Scale Latency Estimation that I used also in the past for other types of measurements. There was a warm-up time and from the 4% from min and max latency were excluded. The initial collection size was around 100.000 documents with an average size of the document around 50KB.
Please take into account that this are the result that I getter for my sandbox. Does not represents the reality for other cases or for general cases.
The obtained results are extremely good and provided high confidence in the reliability of Azure Cosmos DB.

What about RPO and RTO?
Let's take the first one Recovery Point Objective (RPO). The current SLA is interesting, offering a maximum value of 240 minutes for any type of consistency level or no. of replicas. 
The current RPOs are:
  • Strong / Single Master = 0 mins
  • Session / Multi-master < 15 mins
  • Consistent Prefix / Multi-master < 15 mins
  • Eventual / Multi-master < 15 mins
  • Maximum < 240 mins

The Recovery Time Objective (RTO) is similar, offering us an SLA of maximum of 7 days, with:
  • Session / Multi-master = 0 mins
  • Consistent Prefix / Multi-master = 0 mins
  • Eventual / Multi-master = 0 mins
  • Strong / Single master < 15 mins
  • Session / Single master < 15 mins
Conclusion
The performances level of the system can be impacted directly by what level of consistency level we decide to use. Each consistency level had a direct impact on performance, data consistency and costs. In most of the cases, the Session consistency level is a perfect tradeoff between eventual consistency across all active users and performance.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine: threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration: TeamCity .NET 4.51 EF 6.0.2 VS2013 It see

Navigating Cloud Strategy after Azure Central US Region Outage

 Looking back, July 19, 2024, was challenging for customers using Microsoft Azure or Windows machines. Two major outages affected customers using CrowdStrike Falcon or Microsoft Azure computation resources in the Central US. These two outages affected many people and put many businesses on pause for a few hours or even days. The overlap of these two issues was a nightmare for travellers. In addition to blue screens in the airport terminals, they could not get additional information from the airport website, airline personnel, or the support line because they were affected by the outage in the Central US region or the CrowdStrike outage.   But what happened in reality? A faulty CrowdStrike update affected Windows computers globally, from airports and healthcare to small businesses, affecting over 8.5m computers. Even if the Falson Sensor software defect was identified and a fix deployed shortly after, the recovery took longer. In parallel with CrowdStrike, Microsoft provided a too