Skip to main content

Hadoop and HDFS


We all heard about trends. We have trends in music, fashion and of course in IT. For 2013 there were announced some trends that are already part of our lives. Who didn’t hear about cloud, machine to machine (M2M) or NoSQL. All these trends have entered in our lives as part of our ordinary day. Big Data is a trend that existed last year and it remained one of the strongest one.
In the next part of this article I would like to talk about Hadoop. Why? Big Data doesn’t exist without Hadoop. Big Data could be an amount of bytes that the client wouldn’t even know how to process them. Clients started a long time ago to ask for a scalable way of processing data. On ordinary systems, processing 50T would be a problem. Computer systems for indexing and processing this amount of data are extremely costly, not just financial but also in terms of time.
At this point Hadoop is one (maybe the best) of the best processing solution for a big amount of data. First of all let’s see how it appeared and how did it ended in being a system that can run on a 40.000 distribute instances without a problem.


A little bit of history
Several years ago (2003-2004), Google published an article about the way it handles the huge amount of data.  It explained what solution uses to process and store large amounts of data. As the number of sites available on the internet was growing so fast, Apache Software Foundation starts creating Apache Hadoop based on the Google article. We could say that the article has become the standard for storing, processing and analyzing data.
Two important features for which Hadoop is currently a system that many companies adopt it for processing Big Data are scalability and the unique way in which data is processed and stored. About these features we will talk a little later in the article.
During the entire development process, Hadoop system has been and will remain an open source project. At the beginning it was supported by Yahoo which needed an indexing system for their search engine. Because this system was working so well it ended also to be used by Yahoo for publicity.
A very interesting thing is that Hadoop system didn’t appeared overnight and at the beginning it wasn’t as robust as it is today. At the beginning there was a scalability problem when it had to scale up to 10-20 nodes. The same problem was with the performance. Companies as Yahoo, IBM, Cloudera, Hortonworks saw the value that Hadoop system was bringing and they invested in it. Each of this system had a similar system which tried to resolve the same problems. At this moment it became a robust system that can be successfully used. Companies as Yahoo, Facebook, IBM, ebay, Twitter, Amazon use it without a problem.
Since in Hadoop data can be stored very simple and the processed information occupy very little space, any legacy system or big system can store data for a long time very easily and with minimum costs.

Data storage – HDFS
The way Hadoop is built is very interesting. Each part is thought for something big starting with files storage to processing and scalability. One of the most important and interesting component that Hadoop has is the files storage system - Hadoop Distributed File System (HDFS).
Generally when we talk about storage systems with high capacity our thought leads us to custom hardware which is extremely costly (price and maintenance).  HDFS is a system which doesn’t need special hardware. It runs smoothly on normal configurations and can be used together with our home or office computers.

1. Sub-system management
Maybe one of the most important properties of this system is the way that each hardware problem is seen. From the beginning this system it was designed to run on tens, hundreds of instances, therefore any hardware problem that can occur is not seen as an error but as an exception of the normal flow. HDFS is a system that knows that not all the registered components will work. As a system that is aware of this it is always ready to detect any problem that might come and start the recovery procedure.
Each component of the system stores a part of the files and each stored bit can be replicated in one or more locations. HDFS is seen as a system that can be used to store files which have several gigabytes and can reach to several terabytes. The system is prepared to distribute a file on one or more instances.

2. Data access
Normal file system storages have as a main purpose the data storage and they send us the data we need for processing. HDFS is totally different. Because they work with large amounts of data they solve this problem in a very innovative way. Any system we will use, we will have problems in the moment we will want to transfer large amount of data for processing. HDFS allows us to send the processing logic on the components where we keep the data. Through this mechanism the data needed for processing will not be transferred and only the final result must be passed on (only when needed).
In such a system you would expect to have a highly complex versioning mechanism. A system that would allow to have multiple writers on the same file. In fact HDFS is a storage that allows to have only one writer and multiple readers. It is designed this way because of the type of data it contains. These data doesn’t change very often and that’s way it doesn’t need modifications. For example the logs of an application will not change and the same thing happens with the data obtained from an audit. Very often data that is stored after processing they end in being erased or never changed.

3. Portability
Before talking about the architecture of the system and how it works I would like to talk about another property that the system has – portability. For this reason HDFS is not used just together with Hadoop system but also as a storage system. I think this property helped HDFS to be widespread.
From the software point of view, it is written in Java and can run on any system.

4. NameNode and DataNode
If we talk about the architecture of such a system is necessary to introduce two terms in our vocabulary: NameNode and DataNode. It is a master-slave system.
NameNode is „the master” of the storage system. This is handling the storage system of file name and knows where it can find it – mapping files. This system doesn’t stores the file data; he is just dealing with the mapping of the files, knowing in every moment the location where these files are stored. Once the name has been resolve by NameNode it will redirect the clients to the DataNode.
DataNode is „the slave” that stores the actual content of the files. Customers will access the DataNode to access the stored information- reading and writing of the data.
As a system that is ready for the fall of a component, in addition to NameNode, we have a SecondaryNameNode. This component automatically makes NameNode checkpoints and if something happens to the NameNode this component is ready to provide the checkpoint in order to restore the state that the NameNode had it before the fall. Note that SecondaryNameNode will never take the position that the NameNode has. It will not solve the location where the files are stored. The only purpose it’s to create checkpoints for NameNode.

5. Data storage
All data that is stored as files. For the client the file is not divided into several parts even if this happens internally. Internally the file is divided into blocks that will end in being stored on one or more DataNodes. A large file can be stored in 2, 3 or even 20 nodes. The NameSpace controls this and it may require the blocks to be replicated in several locations.
In Figure 1 is the architecture of the system.
What is interesting about this architecture is how it works with files. The accessed data by the customers never passes through the NameNode. Therefore even if we have only one NameNode in the whole system once resolved the location of the files no client request will need to go through the NameNode.

6. File structure
How the files are stored to be accessed by customers is very simple.  The customer can define a structure of directories and files. All these data are stored by the NameNode. This is the only one who knows how folders and files are defined by the customer. Options such as a hard-link or soft-link are not supported by HDFS.

7. Replication
Because the stored data is very important, HDFS allows us to set the number of copies that we want to have for each file. This can be set when creating the file or anytime thereafter. NameNode is the one that knows the number of copies that must exist for each file and makes sure that it exists.
For example, when increasing the number of replications that we want to have for a NameNode file take care that all blocks to be replicated again. NameNode’s job it doesn’t end here. For each block it receives an “I’m alive” signal at a specific time- heartbeat. If one of these blocks does not receive the signal the NameNode will start the recovery procedure automatically.
How the data replicates is extremely complex. HDFS must take into account many factors. When we need to make a new copy we must be careful since this operation consumes bandwidth. For these reasons we have a load-balancer that handles data distribution in cluster, the cluster location where the data is copied to be able to do the load-balance in the best way possible. There are many options for replication; one of the most common is the 30% of responses to be on the same node. In terms of distribution of the replications on racks, 2/3 are on the same rack and the other is on a separate rack.
Data from a DataNode may automatically be moved to another DataNode if it detects that the data is not evenly distributed.
All copies that exist for a file are used. Depending on the location where you want to retrieve the data, the client will have access the closest copy. Therefore, HDFS is a system that knows how the internal network looks like- including every DataNode, rack and other systems.

8. Namespace
The namespace that the NameNode has it stored in the RAM can be easily accessed. It is copied to the hard disk at precisely defined intervals- the image name that is written on the disk is FsImage. Because the copy on the disk is not the same as in the RAM memory, there is a file in which all changes that are made to the file or folder structure are logged- EditLog. This way if something happens to RAM memory or the NameNode, the recovery is simple and can contain the latest changes.

9. Data manipulation
A very interesting thing is how the client can create a file. Initially, data is not stored directly in the DataNode, but in a temporary location. Only when there is enough data for a write operation worthwhile, the NameNode is notified and copies the data in the DataNode.
When a client wants to delete a file it is not physically deleting from the system. The file is only marked for deletion and moved in the trash directory. In the trash is kept only the latest copy of the file and the client can enter into the folder to retrieve it. All files in the trash folder are automatically deleted after a certain period of time.

Conclusion
In this article we saw how Hadoop appeared, which are its main properties and how the data is stored. We saw that HDFS is a system created to work with large amount of data. This is made extremely well and with minimal costs.

Comments

Popular posts from this blog

Windows Docker Containers can make WIN32 API calls, use COM and ASP.NET WebForms

After the last post , I received two interesting questions related to Docker and Windows. People were interested if we do Win32 API calls from a Docker container and if there is support for COM. WIN32 Support To test calls to WIN32 API, let’s try to populate SYSTEM_INFO class. [StructLayout(LayoutKind.Sequential)] public struct SYSTEM_INFO { public uint dwOemId; public uint dwPageSize; public uint lpMinimumApplicationAddress; public uint lpMaximumApplicationAddress; public uint dwActiveProcessorMask; public uint dwNumberOfProcessors; public uint dwProcessorType; public uint dwAllocationGranularity; public uint dwProcessorLevel; public uint dwProcessorRevision; } ... [DllImport("kernel32")] static extern void GetSystemInfo(ref SYSTEM_INFO pSI); ... SYSTEM_INFO pSI = new SYSTEM_INFO(

Azure AD and AWS Cognito side-by-side

In the last few weeks, I was involved in multiple opportunities on Microsoft Azure and Amazon, where we had to analyse AWS Cognito, Azure AD and other solutions that are available on the market. I decided to consolidate in one post all features and differences that I identified for both of them that we should need to take into account. Take into account that Azure AD is an identity and access management services well integrated with Microsoft stack. In comparison, AWS Cognito is just a user sign-up, sign-in and access control and nothing more. The focus is not on the main features, is more on small things that can make a difference when you want to decide where we want to store and manage our users.  This information might be useful in the future when we need to decide where we want to keep and manage our users.  Feature Azure AD (B2C, B2C) AWS Cognito Access token lifetime Default 1h – the value is configurable 1h – cannot be modified

What to do when you hit the throughput limits of Azure Storage (Blobs)

In this post we will talk about how we can detect when we hit a throughput limit of Azure Storage and what we can do in that moment. Context If we take a look on Scalability Targets of Azure Storage ( https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/ ) we will observe that the limits are prety high. But, based on our business logic we can end up at this limits. If you create a system that is hitted by a high number of device, you can hit easily the total number of requests rate that can be done on a Storage Account. This limits on Azure is 20.000 IOPS (entities or messages per second) where (and this is very important) the size of the request is 1KB. Normally, if you make a load tests where 20.000 clients will hit different blobs storages from the same Azure Storage Account, this limits can be reached. How we can detect this problem? From client, we can detect that this limits was reached based on the HTTP error code that is returned by HTTP