Apache Hadoop

August 5, 2021 Rema Shivakumar- CuriouSTEM Staff

In this article, we will explore the topic of Apache Hadoop. Apache Hadoop or Hadoop as it is commonly referred to is an open-source framework to handle, store and process large data sets or Big Data. The data size can range in size from gigabytes upwards to Yottabytes. When handling such large volumes of data, a single large computer system can easily become slow, inefficient, and prone to failure. Hadoop allows a cluster of computers to handle these large datasets in a distributed fashion. Hadoop is a collection of various software utilities that facilitates parallel & distributed processing and storage of data. Because of Hadoop, we can use commodity hardware arranged in clusters instead of expensive higher-end hardware to handle large data sets. Through the use of replication of data, fault tolerance is also built into the system.

The Hadoop framework consists of four main parts:

-- HDFS: The acronym stands for Hadoop Distributed File System. It is responsible for the storage of data in the form of HDFS blocks.

-- MapReduce: The processing part of the Hadoop framework. Mapping and/or reducing operations can be programmed in. Code locality is an important advantage of Hadoop. The code is moved to the data location ensuring fast and efficient processing.

-- Hadoop Common: It provides the major Java libraries that are used throughout operations in Hadoop.

-- YARN: It stands for Yet Another Resource Navigator. The YARN is responsible for resource allocation, management of various data nodes, and communications.

Hadoop Framework

Picture Source: educba