Skip to main content

Should(n't) peek and lock a message from Azure Service Bus for 5 hours

Let’s talk about anti-patterns when business drives technology.
Imagine that you are working for a bank where you implement a message base communication build on top of Azure Service Bus.
Everything works great for a few years until one of the business stakeholders implement their business application in such a way that a message needs to be locked for 5 hours. They used competing consumer pattern to support business requirements -  to wait for confirmation from external users for a maximum of 5 hours. If there is no confirmation from the users, the request (message) shall be processed one more time.

Issues
There are some mistakes in the way how business reqs was implemented:
(1) The maximum Peek and Lock of a message from Azure Service Bus is 5 minutes. It is impossible to peek a message from a subscription or queue and do a lock for 5 hours.
(2) It does not clearly define the number of retries. If the user(s) does not confirm the action would mean that the message will be available over and over again. They are blocking other messages to be processed and increase the load of the system. You would need to specify a maximum number of retries before pushing the message to dead-letter queues.
For actions that take longer than 1 or 2 minutes, Peek and Lock on top of Azure Service Bus might not be the best solution. You would need to challenge yourself and see if an ESB is the best option for your needs.

Alternative solution
The advantage with Peek and Lock and Azure Service Bus is reliability. When content it is not processed with success by the consumer, the message is available again in the subscription for another try. The tricky part is to find a simple way to have the same quality attribute at a low cost.
After discussion with the business stakeholder, we identified that the maximum retry times is 3, and because of multiple consumers, Azure Service Bus is one of the preferred solutions. Azure Cosmos DB is reliable storage that could be used in combination with TTL feature.
We could have a system that consumes messages from Azure Service Bus in Peek-and-Lock mode. The messages are pushed to Azure Cosmos DB as a document with TTL (Time To Live) set to 5 hours. This means that the content is stored for 5 hours in Cosmos DB. After 5 hours, if the user or a 3rd party confirm the action, the message can be automatically removed from the Azure Cosmos DB.
An Azure Function can be registered to receive a notification when the TTL expires, which can push the message back to the Azure Service Bus. At this point, the retry counter can be incremented by 1. When the retry counter becomes equal to 3, we can log an issue and generate a business alert.

Things to consider:
(1) The role of Azure Service Bus to manage multiple consumers is crucial. Implementing it directly on a NoSQL solution it is error prompt.
(2) We still use Peek and Lock at the consumer level to ensure that we have a reliable way to process the message and send it to Azure Cosmos DB.
(3) Any fatal error at Azure Functions level needs to be managed by ourselves, and it is the only location where we could lose a message because of an application error. The Service Bus consumer is protected by Peek and Lock.

In the below diagrams, you can find a high-level overview of the solution.




Conclusion
Of course, we might find another solution. For the current case, I consider this one of the most simple solutions that have a low impact on implementation cost and without affecting the operations activities. Both Azure Service Bus and Azure Cosmos DB are reliable storages for messages and documents and Peek and Lock combined with TTL help us to connect the end-to-end flow.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(

Azure AD and AWS Cognito side-by-side

In the last few weeks, I was involved in multiple opportunities on Microsoft Azure and Amazon, where we had to analyse AWS Cognito, Azure AD and other solutions that are available on the market. I decided to consolidate in one post all features and differences that I identified for both of them that we should need to take into account. Take into account that Azure AD is an identity and access management services well integrated with Microsoft stack. In comparison, AWS Cognito is just a user sign-up, sign-in and access control and nothing more. The focus is not on the main features, is more on small things that can make a difference when you want to decide where we want to store and manage our users.  This information might be useful in the future when we need to decide where we want to keep and manage our users.  Feature Azure AD (B2C, B2C) AWS Cognito Access token lifetime Default 1h – the value is configurable 1h – cannot be modified

What to do when you hit the throughput limits of Azure Storage (Blobs)

In this post we will talk about how we can detect when we hit a throughput limit of Azure Storage and what we can do in that moment. Context If we take a look on Scalability Targets of Azure Storage ( https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/ ) we will observe that the limits are prety high. But, based on our business logic we can end up at this limits. If you create a system that is hitted by a high number of device, you can hit easily the total number of requests rate that can be done on a Storage Account. This limits on Azure is 20.000 IOPS (entities or messages per second) where (and this is very important) the size of the request is 1KB. Normally, if you make a load tests where 20.000 clients will hit different blobs storages from the same Azure Storage Account, this limits can be reached. How we can detect this problem? From client, we can detect that this limits was reached based on the HTTP error code that is returned by HTTP