Thursday, May 12, 2022

Hadoop Archives

Hadoop Archives :

Hadoop Archive is a facility that packs up small files into one compact HDFS block to avoid memory wastage of name nodes.
Name node stores the metadata information of the HDFS data.
If 1GB file is broken into 1000 pieces then namenode will have to store metadata about all those 1000 small files.
In that manner,namenode memory will be wasted in storing and managing a lot of data.
HAR is created from a collection of files and the archiving tool will run a MapReduce job.
These Maps reduces jobs to process the input files in parallel to create an archive file.
Hadoop is created to deal with large files data, so small files are problematic and to be handled efficiently.
As a large input file is split into a number of small input files and stored across all the data nodes, all these huge numbers of records are to be stored in the name node which makes the name node inefficient.
To handle this problem, Hadoop Archive has been created which packs the HDFS files into archives and we can directly use these files as input to the MR jobs.
It always comes with *.har extension.
HAR Syntax :

hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

Example :

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

Random Posts