Apache Hadoop is an open source framework written in Java for distributed processing of large scale datasets over clusters of computers. Hadoop is mostly used for achieving high throughput while accessing large data sets. Since the framework is written in Java, it has high platform compatibility and can run anywhere, adding to its list of benefits. The Hadoop Architecture consists of four major components:
- Hadoop Common
- YARN
- HDFS
- MapReduce
Hadoop Common
This includes Java libraries and utilities for the functioning of Hadoop. Hadoop as a framework requires several core libraries and Java classes for its own functioning. These are essential for Hadoop not only for its functioning, but also for optimal performance. There may be additional utilities that are useful for running particular Hadoop modules. All such libraries and utilities are together categorized under Hadoop Common.
YARN
YARN is an acronym for Yet Another Resource Negotiator. As the name suggests, it negotiates resources, that is, it is responsible for managing and allocation of resources. YARN is the framework which allocates resources to computational tasks. This framework consists of two components - resource manager and node manager. Resource manager is the master and node manager is the slave in this architecture. There is one resource manager per cluster which is responsible for managing resources such as memory and processors among the nodes. The node manager will serve its resources to the cluster.
HDFS
HDFS is short form for Hadoop Distributed File System. It is responsible for distributed storage facility of Hadoop. Hadoop can also work with other simpler file systems, but HDFS is the one which provides the highest throughput for large data sets. HDFS also follows master - slave architecture. Here the master is known as Name Node, and the slaves are known as Data Nodes. The Name Node is responsible for holding metadata of the file system, that is information about data, managing data etc. The Data Nodes contain the actual data.
MapReduce
MapReduce in Hadoop performs distributed computation across clusters. The computation in MapReduce is done by exploiting parallelism techniques. MapReduce is a combination of two processing stages - Map and Reduce. Map is used to map the input data into key - value pairs or tuples, while Reduce function takes input from the output of the Map and reduces the set of tuples. Similar to other components, MapReduce also follows a master-slave architecture, between JobTrackers and TaskTrackers. The master is JobTracker, which manages all Hadoop jobs. A job in Hadoop is a Hadoop execution of client request. Each job is broken down into tasks, which are then distributed to TaskTrackers, the slaves in this architecture.
A client first submits a request in the form of job to Hadoop. This request should follow a general convention in which the client should submit the input file location, the output file location, the job configuration or actual request and/or java classes for implementation of map reduce functions. The JobTracker distributes the job to TaskTrackers, which implement their own map reduce functions and save the output in output file location specified.
Wednesday, 14 September 2016