Skip to main content

Data Model for Reporting over Windows Azure Tables

One of my colleague tried implemented a browser history mechanism for MVC. Based on this data he would like to generate two simple reports:
  1. Top 5 web addresses accessed by a given user per day
  2. Top 5 web addresses accessed by all user
When the data store is implemented using SQL Azure, this problem can be resolved very simple.  The question that appeared here is: Can we implement a data store model using Windows Azure Table Services?
I will try to propose a possible data model that is using Windows Azure Table.
In the default implementation (using SQL Azure), there were 3 kind of information that is stored in the SQL tables:
  • URL visited
  • User
  • Date
Because we don’t have an order by, count or a max function in a query over Windows Azure Table we need to think at a model that would help us with this. We will start with the first requirement:

Top 5 web addresses accessed by a given user per day
To be able to solve this problem we need a data model that permit us to refer to URLs for a given day and user.
Theoretically, we can have an unlimited number of tables in Windows Azure Tables (and we don’t need to pay for each table in part). Because of this we can have a different table for each day. In this way the cleanup mechanism will be extremely simple. Also, when we want to access historical, selecting a specific day will be very easily.
We already know that each table from Windows Azure Tables contains two fields that play the role of keys: Partition Key and Row Key. Partition key can be used with success when we have different items types saved in the same table. Entities could be grouped based on the user, because of this we can save in the Partition Key the user id. In this way we will be able to specify a specific date and user.
In the row key we can store the visited URL. Another property (column) will be needed to store how many times the URL was visited in a specific day. On the server we can define a mechanism that will add or increment a visited URL.
The downside of this solution is that we will need to make two different transactions when we want to increment the history counter. One transaction that brings the current counter for the given user and URL (if exist) and another one that update (insert) the counter value.
When we need to generate the TOP 5 for a given day per user we will have to load all URLs for specific user and order this by the counter value. We don’t need to forget that in Windows Azure Tables we don’t have support for Order By and Top N functions of a query – because of this we need to retrieve all the URLs that a user visited in a given day.

Top 5 web addresses accessed by all user

Option 1
To full fit this request we need to create another Windows Azure Table that will store the URL and the counter for each URL. To be able to support Top 5 not only for URL but also for domains we will store in the Partition Key the URL domain and the Row Key the rest of the URL path.
Another solution would be to have in the Partition Key the same values for all rows and in the Row Key the full URL. We don’t want to have in the Partition Key the URL (and in the Row Key the counter) because in this case we risk having a table that is fragmented – when a Windows Azure Table is too big, Windows Azure can split our table based on the Partition Key and move our spited table on different machines (this is not visible for consumer – is an implementation detail).
The biggest downside of this solution in the moment when we need to retrieve the top 5 most visited URLs. We will need to retrieve all the content from our table and calculate the top 5.

Option 2
Another possible solution for this problem is to have more than one table. Based on how many a URL was visited we will be stored in a specific table. For example we would have
  • VisitedUrls1to100
  • VisitedUrls101to1000
  • VisitedUrls1001to10000
  • VisitedUrls10001to100000
For example when the URL is in the first table (VisitedUrls1to100) and the counter value will reach 101, that the entity will be moved to the next table and so on. In this implementation we will have a big problem to find a URL to increment the counter. To optimize this we would need to use another table that would store the URL and the table name where our URL counter can be found (VisitedUrls1to100).
The good part of this implementation is in the moment when we need to calculate top 5 and we can retrieve only a part of the URLs.

These were the possible implementation that I see using Windows Azure Tables. I think that this problem is not suitable for Windows Azure Table and a relational database is better for this case.
I didn’t forget about a problem that needs to be solved here – concurrency. I will come with a post tomorrow.
Part 2

Comments

Post a Comment

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(

Azure AD and AWS Cognito side-by-side

In the last few weeks, I was involved in multiple opportunities on Microsoft Azure and Amazon, where we had to analyse AWS Cognito, Azure AD and other solutions that are available on the market. I decided to consolidate in one post all features and differences that I identified for both of them that we should need to take into account. Take into account that Azure AD is an identity and access management services well integrated with Microsoft stack. In comparison, AWS Cognito is just a user sign-up, sign-in and access control and nothing more. The focus is not on the main features, is more on small things that can make a difference when you want to decide where we want to store and manage our users.  This information might be useful in the future when we need to decide where we want to keep and manage our users.  Feature Azure AD (B2C, B2C) AWS Cognito Access token lifetime Default 1h – the value is configurable 1h – cannot be modified

What to do when you hit the throughput limits of Azure Storage (Blobs)

In this post we will talk about how we can detect when we hit a throughput limit of Azure Storage and what we can do in that moment. Context If we take a look on Scalability Targets of Azure Storage ( https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/ ) we will observe that the limits are prety high. But, based on our business logic we can end up at this limits. If you create a system that is hitted by a high number of device, you can hit easily the total number of requests rate that can be done on a Storage Account. This limits on Azure is 20.000 IOPS (entities or messages per second) where (and this is very important) the size of the request is 1KB. Normally, if you make a load tests where 20.000 clients will hit different blobs storages from the same Azure Storage Account, this limits can be reached. How we can detect this problem? From client, we can detect that this limits was reached based on the HTTP error code that is returned by HTTP