Showing posts with label HDFS. Show all posts
Showing posts with label HDFS. Show all posts

Thursday, December 26, 2013

Big Data: What is HDFS?

HDFS stand for Hadoop Distributed File System. A file system is a technique of organizing files on physical media such as on hard disk, CDs, DVDs, flash drives etc.

HDFS is an important component of Hadoop Architecture.

Just like windows uses NTFS, FAT16 and FAT32 file system and Linux uses Ext2, Ext3, Isofs, Sysfs, Procfs file systems to store and organize files, the architecture of Hadoop uses the HDFS file system to organize files.

So how does HDFS store data?

Unlike traditional file system HDFS do not store data in one hard disk. HDFS stores data into blocks.

Unlike traditional file system a block size in Hadoop is very large. In traditional file system (NTFS or Linux file system) a block size is usually 4KB. In HDFS a block size is minimum of 64MB. A block size can be configured for 128MB or 256MB or 512MB or even 1GB.

HDFS stores three copies of every file. Each of the copy is stored in a different node. A node is a single computer in the Hadoop network.

For example a file with 20MB size is stored in three nodes.

A file which is larger than the configured block (64MB or 128MB or as configured in HDFS) is broken down as per block size. For example if there is a file of 300MB and the block size configured in HDFS is 64MB than this file will be stored into 5 blocks.

64MB 64MB 64MB 64MB 44MB

HDFS file system will distribute the data into five blocks which will be scattered into different racks and nodes. HDFS will store three copies of each block.

To understand how HDFS stores and organize files we must understand the fundamental architecture of Hadoop. Hadoop is a Master-Slave architecture. In this type of structure there is one Master and multiple slaves.

A Master contains NamedNode and JobTracker. A Slave contains DataNode and TaskTracker.

The NameNode manages the cluster metadata and DataNodes.

  • NameNode keeps track of number of blocks assigned to each file.
  • NameNode keep tracks of all blocks residing in each of the DataNodes.
  • It also contains list of active nodes in each of the rack.
  • It actively monitors the number of replicas of a block.

The JobTracker is responsible to receive client applications requests and assigning them to TaskScheduler. It try to distribute the task as close to data as possible. It means that the JobScheduler see what task is needed and where the data is residing for the concerned task. It assigns the task to the nearest node.

In the Slave architecture each of the hard disk should have ext3 or ext4 file system on it. The TaskScheduler accepts the task from JobScheduler. Each of the TaskScheduler has a fix slot which indicated number of task it can accept. Once the task is successfully processed or if it fails, it notifies to the JobTracker.

A DataNode in Slave architecture stores the actual data and interact with NameNode. DataNode can directly interact to client application once its location is provided by the NameNode.

Here is one of the video from YouTube where Sameer Farooqui has explained the HDFS in detail


Please visit the Big Data page to read all articles on Big Data

Monday, December 23, 2013

Big Data: What is Hadoop?

To understand Hadoop let us start with a simple question. Who is responsible to compute and process data? Answer is Computer.

How efficiently computer can process the data? Answer is it depends on how much data need to be processed and what is the computer configuration (Speed, space, number of processors, RAM etc.).

If data grows we can add additional hardware (hard disk, processors, RAM) to increase the efficiency of computer to process and compute data.

What if data grows to such extent that it can no longer be processed by a Single Computer? Today data is growing at a very high speed. Every day millions of tweet are generated, every hours billions of transactions are recorded by Wal-Mart, face book posts, comments, likes, news, videos, audios, images are getting recorded every minute across the world. Do you think a Single computer no matter how powerful configuration it has can handle this super massive data? The answer is No.

So what is the solution?

To process this massive data a very old and simple theory works “Divide and Conquer”. In Computer world we can say “Divide and Compute”. Yes, processing this ever growing data is not possible with one computer or one super powerful server. To process and compute this data, this data needs to be divided into small blocks and this small block of data needs to be processed by multiple computers concurrently. The processed data needs to be consolidated and return as one output. All this needs to done in real time.

This is where Hadoop comes into picture. Hadoop is a not a single software or hardware. Hadoop is a platform consists of set of tools and Technologies. The technologies which are core to the Hadoop are Google MapReduce and HDFS (Hadoop File System).

Google MapReduce is the technology develops by Google which performs the task of dividing the tasks into small sub-task. It distributes the sub-task to multiple computers called nodes. When all the nodes are done with their task; the Google MapReduce consolidate the result of all sub-task and combined them into one output. This one output is return to the calling application.

Google MapReduce consists of two programs – Map and Reduce. Map is responsible for dividing the task into small pieces (sub-task) and to distribute them to multiple computers (nodes) for processing. Reduce program is responsible to collect the output processed by individual nodes and consolidate them into one.

HDFS i.e. Hadoop Distributed File System is responsible to manage the storage of huge massive data. It does it by dividing the huge massive data into small block of data. Data is broken into small parts such as block of 128MB, 256MB, 512MB, 1GB etc. The data when distributed to multiple computers or nodes are complete data which need to be processed. A node or computer does not need to request or make additional round trip for data request. When a data is given to the computer for processing it is complete data which is required by the node/computer.

Hadoop is an open source framework to process large data sets and it is managed under Apache License. In addition to MapReduce and HDFS there are other tools which come under Hadoop umbrella. Each of these tools provide distinct feature. For example Chukwa is a data collection system for managing large distributing system. Pig is a data flow language for parallel computation.

Download Hadoop

To learn the Hadoop you can download it from Apache website. We can start with a Single Node setup and move to Cluster setup

Hadoop Videos








Please visit the Big Data page to read all articles on Big Data

Popular Posts

Real Time Web Analytics