One of my colleague tried implemented a browser history mechanism for MVC. Based on this data he would like to generate two simple reports:
I will try to propose a possible data model that is using Windows Azure Table.
In the default implementation (using SQL Azure), there were 3 kind of information that is stored in the SQL tables:
Top 5 web addresses accessed by a given user per day
To be able to solve this problem we need a data model that permit us to refer to URLs for a given day and user.
Theoretically, we can have an unlimited number of tables in Windows Azure Tables (and we don’t need to pay for each table in part). Because of this we can have a different table for each day. In this way the cleanup mechanism will be extremely simple. Also, when we want to access historical, selecting a specific day will be very easily.
We already know that each table from Windows Azure Tables contains two fields that play the role of keys: Partition Key and Row Key. Partition key can be used with success when we have different items types saved in the same table. Entities could be grouped based on the user, because of this we can save in the Partition Key the user id. In this way we will be able to specify a specific date and user.
In the row key we can store the visited URL. Another property (column) will be needed to store how many times the URL was visited in a specific day. On the server we can define a mechanism that will add or increment a visited URL.
The downside of this solution is that we will need to make two different transactions when we want to increment the history counter. One transaction that brings the current counter for the given user and URL (if exist) and another one that update (insert) the counter value.
When we need to generate the TOP 5 for a given day per user we will have to load all URLs for specific user and order this by the counter value. We don’t need to forget that in Windows Azure Tables we don’t have support for Order By and Top N functions of a query – because of this we need to retrieve all the URLs that a user visited in a given day.
Top 5 web addresses accessed by all user
Option 1
To full fit this request we need to create another Windows Azure Table that will store the URL and the counter for each URL. To be able to support Top 5 not only for URL but also for domains we will store in the Partition Key the URL domain and the Row Key the rest of the URL path.
Another solution would be to have in the Partition Key the same values for all rows and in the Row Key the full URL. We don’t want to have in the Partition Key the URL (and in the Row Key the counter) because in this case we risk having a table that is fragmented – when a Windows Azure Table is too big, Windows Azure can split our table based on the Partition Key and move our spited table on different machines (this is not visible for consumer – is an implementation detail).
The biggest downside of this solution in the moment when we need to retrieve the top 5 most visited URLs. We will need to retrieve all the content from our table and calculate the top 5.
Option 2
Another possible solution for this problem is to have more than one table. Based on how many a URL was visited we will be stored in a specific table. For example we would have
The good part of this implementation is in the moment when we need to calculate top 5 and we can retrieve only a part of the URLs.
These were the possible implementation that I see using Windows Azure Tables. I think that this problem is not suitable for Windows Azure Table and a relational database is better for this case.
I didn’t forget about a problem that needs to be solved here – concurrency. I will come with a post tomorrow.
Part 2
- Top 5 web addresses accessed by a given user per day
- Top 5 web addresses accessed by all user
I will try to propose a possible data model that is using Windows Azure Table.
In the default implementation (using SQL Azure), there were 3 kind of information that is stored in the SQL tables:
- URL visited
- User
- Date
Top 5 web addresses accessed by a given user per day
To be able to solve this problem we need a data model that permit us to refer to URLs for a given day and user.
Theoretically, we can have an unlimited number of tables in Windows Azure Tables (and we don’t need to pay for each table in part). Because of this we can have a different table for each day. In this way the cleanup mechanism will be extremely simple. Also, when we want to access historical, selecting a specific day will be very easily.
We already know that each table from Windows Azure Tables contains two fields that play the role of keys: Partition Key and Row Key. Partition key can be used with success when we have different items types saved in the same table. Entities could be grouped based on the user, because of this we can save in the Partition Key the user id. In this way we will be able to specify a specific date and user.
In the row key we can store the visited URL. Another property (column) will be needed to store how many times the URL was visited in a specific day. On the server we can define a mechanism that will add or increment a visited URL.
The downside of this solution is that we will need to make two different transactions when we want to increment the history counter. One transaction that brings the current counter for the given user and URL (if exist) and another one that update (insert) the counter value.
When we need to generate the TOP 5 for a given day per user we will have to load all URLs for specific user and order this by the counter value. We don’t need to forget that in Windows Azure Tables we don’t have support for Order By and Top N functions of a query – because of this we need to retrieve all the URLs that a user visited in a given day.
Top 5 web addresses accessed by all user
Option 1
To full fit this request we need to create another Windows Azure Table that will store the URL and the counter for each URL. To be able to support Top 5 not only for URL but also for domains we will store in the Partition Key the URL domain and the Row Key the rest of the URL path.
Another solution would be to have in the Partition Key the same values for all rows and in the Row Key the full URL. We don’t want to have in the Partition Key the URL (and in the Row Key the counter) because in this case we risk having a table that is fragmented – when a Windows Azure Table is too big, Windows Azure can split our table based on the Partition Key and move our spited table on different machines (this is not visible for consumer – is an implementation detail).
The biggest downside of this solution in the moment when we need to retrieve the top 5 most visited URLs. We will need to retrieve all the content from our table and calculate the top 5.
Option 2
Another possible solution for this problem is to have more than one table. Based on how many a URL was visited we will be stored in a specific table. For example we would have
- VisitedUrls1to100
- VisitedUrls101to1000
- VisitedUrls1001to10000
- VisitedUrls10001to100000
The good part of this implementation is in the moment when we need to calculate top 5 and we can retrieve only a part of the URLs.
These were the possible implementation that I see using Windows Azure Tables. I think that this problem is not suitable for Windows Azure Table and a relational database is better for this case.
I didn’t forget about a problem that needs to be solved here – concurrency. I will come with a post tomorrow.
Part 2
Data Depulication also helps in removing congestion in the modules
ReplyDelete