Skip to main content

Azure HDInsight (Day 25 of 31)

List of all posts from this series: http://vunvulearadu.blogspot.ro/2014/11/azure-blog-post-marathon-is-ready-to.html

Short Description 
HDInsight enables us to provision Hadoop and HBase cluster over Azure infrastructure. This mean that we can use Hadoop over Azure platform without having to install, provision and make complex configurations of Hadoop. We can use HDInsight as an out of the box Hadoop solution.


Main Features 
Hadoop Version
The Hadoop distribution that is running on Azure is developed and maintained by Hortonworks Data Platform.
New version releasing
It seems that the new versions of Hadoop reach Azure faster and faster. If in the past it use to take 4-6 months for a new version of Hadoop to reach Azure and HDInsight, now it is taking only a few weeks.
HDFS
Hadoop Distributed File System(HDFS) exist also in HDInsight giving us the possibility to store data in a distributed file system.
MapReduce
The same programming model that we already know from Hadoop can be found into HDInsight. In general you should know that all the features that exist in Hadoop can be found in HDInsight, having the same properties and functionalities.
Ambari
Can be used to monitor, manage and provision the cluster that is used for processing. The API of Ambari hide all the complexity that is behind Hadoop and let us to focus on the final result and less on the infrastructure.
HBase
The NoSQL Solution develop by Hadoop and used to store data in a semi-structured way. We can store billions of rows in this data storage. Behind HBase, Azure implementation is using Azure Storage.
Hive
Developed over MapReduce, it allows us to query using SQL like the Hadoop storage. Can be used with success when the stored data that is queried in a more structured.
Pig
Similar with Hive, but is recommended to use when data is not structured and we need to execute complex MapReduce operations over large data sets.
Mahout
It allow us to define a machine learning that run over Hadoop. Based on past evens and data sets we can determine what will be the future behavior of different components.
MapReduce
On the core components of Hadoop that allow us to execute operations over large data sets in a distributed way (on multiple nodes – parallel). The data is organized in a (key, value) for later processing.
YARM
The next generation of MapReduce (MRv2) that will allow us to slit the jobs in two different entities:

  • Scheduling and Monitor
  • Resource Management

Oozie
It is the framework used to coordinate the workflow that is behind this great system. Can be used for different tasks like shell script or scheduling.
Storm
Real time computing system for processing streams of data.
Zookeeper
Contains small amount of metadata (like location, configuration, status) and is used to coordinate large distributed system using a simple hierarchy of data regions.
Sqoop
It used to import/export data from Hadoop to relational database like SQL.
Out of the box clusters
Hadoop clusters can be provision in minutes, without having to configure and manage them
Integration with Azure Services
HDInsight is integrated with different Azure Services like SQL Database or Websites. In this way we can implement different scenarios more easily.
Power Query
There is full support for Excel using Power Query to access and view data from Hadoop.
Cluster Size
We have the ability to specify the size of the Hadoop cluster that we want to use. Based on our needs we can set a custom cluster size.
Instances type
Users have the ability to specify what kind of machines to use in their HDInsight cluster.
Low storage cost
Because Azure Storage Blobs are used to store Hadoop data, the cost of storage is low. In this way we can store large amount of data without thinking about costs.
Storage scaling
Storing data in blobs remove the problems that can appear when we need to scale out the HDFS.
Migrate Hadoop from on-premises to HDInsight
With integration of HDP (Hortonworks Data Platform) we can migrate data from one storage to another automatically. Also using HDP we are allowed to query HDInsight and on-premises instance in the same time.

Limitations 
Start/Stop Cluster
In this moment we don’t have the ability to stop a cluster. The current state of stop is equivalent with delete. Content from blobs is not lost.
Elastic size of cluster
In this moment we cannot dynamically change the size of a cluster. In this moment you need to tear down and rebuild the clusters if you need to change the size of it.

Applicable Use Cases 
Below you can find some use cases when I would use HDInsight.
Analyze information from Smart Home solution
Data that are produced by a Smart Home solution can be very valuable, but the volume of data is very high. For this scenario HDInsight can be used with success to analyze the data and extract valuable information from it.
eCommerce application logs
 An eCommerce application logs is like a gold mine. Based on this information you can know what the system bottleneck are, what products are more attractive for clients and many more. To analyze of all this data you can use with success HDInsight.
Image analyzing
Image analyzing is a job that consume not only a lot of CPU power, but also a lot of storage. Because of this an image analyzing system can be constructed over HDInsight.

Code Sample 
A sample that a realy enjoyed and is very usefull can be found here: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-sample-pi-estimator/

Pros and Cons 
Pros

  • Scalable
  • Easy to integrate
  • Out of the box solution
  • Integration with a lot of external libraries and systems


Cons

  • Start/Stop features is not yet supported
  • Elastic-scale is not yet supported


Pricing 
If you are taking into account HDInsight, you should keep an eye on:

  • Cluster Size
  • Instances type
  • Outbound traffic
  • Storage size


Conclusion
Azure HDInsight it is a have to if you store your data and logs in Azure Storage. Don’t be afraid to analyze your data, because you may find very interesting things/facts.

Comments

Popular posts from this blog

How to check in AngularJS if a service was register or not

There are cases when you need to check in a service or a controller was register in AngularJS.
For example a valid use case is when you have the same implementation running on multiple application. In this case, you may want to intercept the HTTP provider and add a custom step there. This step don’t needs to run on all the application, only in the one where the service exist and register.
A solution for this case would be to have a flag in the configuration that specify this. In the core you would have an IF that would check the value of this flag.
Another solution is to check if a specific service was register in AngularJS or not. If the service was register that you would execute your own logic.
To check if a service was register or not in AngularJS container you need to call the ‘has’ method of ‘inhector’. It will return TRUE if the service was register.
if ($injector.has('httpInterceptorService')) { $httpProvider.interceptors.push('httpInterceptorService&#…

ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded

Today blog post will be started with the following error when running DB tests on the CI machine:
threw exception: System.InvalidOperationException: The Entity Framework provider type 'System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer' registered in the application config file for the ADO.NET provider with invariant name 'System.Data.SqlClient' could not be loaded. Make sure that the assembly-qualified name is used and that the assembly is available to the running application. See http://go.microsoft.com/fwlink/?LinkId=260882 for more information. at System.Data.Entity.Infrastructure.DependencyResolution.ProviderServicesFactory.GetInstance(String providerTypeName, String providerInvariantName) This error happened only on the Continuous Integration machine. On the devs machines, everything has fine. The classic problem – on my machine it’s working. The CI has the following configuration:

TeamCity.NET 4.51EF 6.0.2VS2013
It seems that there …

Run native .NET application in Docker (.NET Framework 4.6.2)

Scope
The main scope of this post is to see how we can run a legacy application written in .NET Framework in Docker.

Context
First of all, let’s define what is a legacy application in our context. By a legacy application we understand an application that runs .NET Framework 3.5 or higher in a production environment where we don’t have any more the people or documentation that would help us to understand what is happening behind the scene.
In this scenarios, you might want to migrate the current solution from a standard environment to Docker. There are many advantages for such a migration, like:

Continuous DeploymentTestingIsolationSecurity at container levelVersioning ControlEnvironment Standardization
Until now, we didn’t had the possibility to run a .NET application in Docker. With .NET Core, there was support for .NET Core in Docker, but migration from a full .NET framework to .NET Core can be costly and even impossible. Not only because of lack of features, but also because once you…