Thursday, December 26, 2013

Big Data: What is HDFS?

HDFS stand for Hadoop Distributed File System. A file system is a technique of organizing files on physical media such as on hard disk, CDs, DVDs, flash drives etc.

HDFS is an important component of Hadoop Architecture.

Just like windows uses NTFS, FAT16 and FAT32 file system and Linux uses Ext2, Ext3, Isofs, Sysfs, Procfs file systems to store and organize files, the architecture of Hadoop uses the HDFS file system to organize files.

So how does HDFS store data?

Unlike traditional file system HDFS do not store data in one hard disk. HDFS stores data into blocks.

Unlike traditional file system a block size in Hadoop is very large. In traditional file system (NTFS or Linux file system) a block size is usually 4KB. In HDFS a block size is minimum of 64MB. A block size can be configured for 128MB or 256MB or 512MB or even 1GB.

HDFS stores three copies of every file. Each of the copy is stored in a different node. A node is a single computer in the Hadoop network.

For example a file with 20MB size is stored in three nodes.

A file which is larger than the configured block (64MB or 128MB or as configured in HDFS) is broken down as per block size. For example if there is a file of 300MB and the block size configured in HDFS is 64MB than this file will be stored into 5 blocks.

64MB 64MB 64MB 64MB 44MB

HDFS file system will distribute the data into five blocks which will be scattered into different racks and nodes. HDFS will store three copies of each block.

To understand how HDFS stores and organize files we must understand the fundamental architecture of Hadoop. Hadoop is a Master-Slave architecture. In this type of structure there is one Master and multiple slaves.

A Master contains NamedNode and JobTracker. A Slave contains DataNode and TaskTracker.

The NameNode manages the cluster metadata and DataNodes.

  • NameNode keeps track of number of blocks assigned to each file.
  • NameNode keep tracks of all blocks residing in each of the DataNodes.
  • It also contains list of active nodes in each of the rack.
  • It actively monitors the number of replicas of a block.

The JobTracker is responsible to receive client applications requests and assigning them to TaskScheduler. It try to distribute the task as close to data as possible. It means that the JobScheduler see what task is needed and where the data is residing for the concerned task. It assigns the task to the nearest node.

In the Slave architecture each of the hard disk should have ext3 or ext4 file system on it. The TaskScheduler accepts the task from JobScheduler. Each of the TaskScheduler has a fix slot which indicated number of task it can accept. Once the task is successfully processed or if it fails, it notifies to the JobTracker.

A DataNode in Slave architecture stores the actual data and interact with NameNode. DataNode can directly interact to client application once its location is provided by the NameNode.

Here is one of the video from YouTube where Sameer Farooqui has explained the HDFS in detail


Please visit the Big Data page to read all articles on Big Data

Popular Posts

Real Time Web Analytics