To understand Hadoop let us start with a simple question. Who is responsible to compute and process data? Answer is Computer.
How efficiently computer can process the data? Answer is it depends on how much data need to be processed and what is the computer configuration (Speed, space, number of processors, RAM etc.).
If data grows we can add additional hardware (hard disk, processors, RAM) to increase the efficiency of computer to process and compute data.
What if data grows to such extent that it can no longer be processed by a Single Computer? Today data is growing at a very high speed. Every day millions of tweet are generated, every hours billions of transactions are recorded by Wal-Mart, face book posts, comments, likes, news, videos, audios, images are getting recorded every minute across the world. Do you think a Single computer no matter how powerful configuration it has can handle this super massive data? The answer is No.
So what is the solution?
To process this massive data a very old and simple theory works “Divide and Conquer”. In Computer world we can say “Divide and Compute”. Yes, processing this ever growing data is not possible with one computer or one super powerful server. To process and compute this data, this data needs to be divided into small blocks and this small block of data needs to be processed by multiple computers concurrently. The processed data needs to be consolidated and return as one output. All this needs to done in real time.
This is where Hadoop comes into picture. Hadoop is a not a single software or hardware. Hadoop is a platform consists of set of tools and Technologies. The technologies which are core to the Hadoop are Google MapReduce and HDFS (Hadoop File System).
Google MapReduce is the technology develops by Google which performs the task of dividing the tasks into small sub-task. It distributes the sub-task to multiple computers called nodes. When all the nodes are done with their task; the Google MapReduce consolidate the result of all sub-task and combined them into one output. This one output is return to the calling application.
Google MapReduce consists of two programs – Map and Reduce. Map is responsible for dividing the task into small pieces (sub-task) and to distribute them to multiple computers (nodes) for processing. Reduce program is responsible to collect the output processed by individual nodes and consolidate them into one.
HDFS i.e. Hadoop Distributed File System is responsible to manage the storage of huge massive data. It does it by dividing the huge massive data into small block of data. Data is broken into small parts such as block of 128MB, 256MB, 512MB, 1GB etc. The data when distributed to multiple computers or nodes are complete data which need to be processed. A node or computer does not need to request or make additional round trip for data request. When a data is given to the computer for processing it is complete data which is required by the node/computer.
Hadoop is an open source framework to process large data sets and it is managed under Apache License. In addition to MapReduce and HDFS there are other tools which come under Hadoop umbrella. Each of these tools provide distinct feature. For example Chukwa is a data collection system for managing large distributing system. Pig is a data flow language for parallel computation.
Download Hadoop
To learn the Hadoop you can download it from Apache website. We can start with a Single Node setup and move to Cluster setup
Hadoop Videos
Please visit the Big Data page to read all articles on Big Data