Big Data Analytics: Hadoop Architecture

Apache Hadoop is an open source framework written in Java for distributed processing of large scale datasets over clusters of computers. Hadoop is mostly used for achieving high throughput while accessing large data sets. Since the framework is written in Java, it has high platform compatibility and can run anywhere, adding to its list of benefits. The Hadoop Architecture consists of four major components:

Hadoop Common
YARN
HDFS
MapReduce

Hadoop Common

This includes Java libraries and utilities for the functioning of Hadoop. Hadoop as a framework requires several core libraries and Java classes for its own functioning. These are essential for Hadoop not only for its functioning, but also for optimal performance. There may be additional utilities that are useful for running particular Hadoop modules. All such libraries and utilities are together categorized under Hadoop Common.

YARN

YARN is an acronym for Yet Another Resource Negotiator. As the name suggests, it negotiates resources, that is, it is responsible for managing and allocation of resources. YARN is the framework which allocates resources to computational tasks. This framework consists of two components - resource manager and node manager. Resource manager is the master and node manager is the slave in this architecture. There is one resource manager per cluster which is responsible for managing resources such as memory and processors among the nodes. The node manager will serve its resources to the cluster.

HDFS

HDFS is short form for Hadoop Distributed File System. It is responsible for distributed storage facility of Hadoop. Hadoop can also work with other simpler file systems, but HDFS is the one which provides the highest throughput for large data sets. HDFS also follows master - slave architecture. Here the master is known as Name Node, and the slaves are known as Data Nodes. The Name Node is responsible for holding metadata of the file system, that is information about data, managing data etc. The Data Nodes contain the actual data.

MapReduce

MapReduce in Hadoop performs distributed computation across clusters. The computation in MapReduce is done by exploiting parallelism techniques. MapReduce is a combination of two processing stages - Map and Reduce. Map is used to map the input data into key - value pairs or tuples, while Reduce function takes input from the output of the Map and reduces the set of tuples. Similar to other components, MapReduce also follows a master-slave architecture, between JobTrackers and TaskTrackers. The master is JobTracker, which manages all Hadoop jobs. A job in Hadoop is a Hadoop execution of client request. Each job is broken down into tasks, which are then distributed to TaskTrackers, the slaves in this architecture.

A client first submits a request in the form of job to Hadoop. This request should follow a general convention in which the client should submit the input file location, the output file location, the job configuration or actual request and/or java classes for implementation of map reduce functions. The JobTracker distributes the job to TaskTrackers, which implement their own map reduce functions and save the output in output file location specified.

Add Comment

big data, hadoop

Wednesday, 14 September 2016

The Daily Programmer

Big Data Analytics: Hadoop Architecture

Hadoop Common

YARN

HDFS

MapReduce

Related Posts