Data Locality in Hadoop
Hadoop handles very large amounts of data. It uses the concept of data locality to improve performance and reduce network congestion. Data locality is the process of moving the computational programs to the data nodes where the data is stored instead of moving data to the code. Since code is smaller than the actual data that needs to be processed. Hadoop stores data using HDFS (Hadoop Distributed File System). The large data sets are split into blocks which are stored across many data nodes. When a MapReduce task is given to run, the name node which assigns data blocks to data nodes. The namenode sends the MapReduce task to the data node where the data is stored. The requirement for successfully using data locality is that Hadoop is aware of the nodes where the data is stored.
There are multiple categories for Hadoop Data Locality:
--Data local data locality: Data is located in the same node where the mapper can be run, which is the best scenario.
--Intra Rack data locality: Due to resource constraints, the mapper is run in a different node but in the same rack.
--Inter Rack data locality: Due to resource constraints, the mapper cannot be run in the same rack. The mapper is run in a different rack.
There are many advantages of data locality such as high throughput of the overall system and faster execution of code. But sometimes due to the large size of clusters and different types of nodes available, data locality becomes less and more complex to achieve.