Explain distributed cache in Hadoop ?

Distributed cache is a facility provided by Hadoop map reduce framework to access small file needed by application during its execution. These files are small as it is in KB’s and MB’s in size. Type of files mainly text, archive or jar files, these files are small that is why it will keep in the cache memory which is one of the fast memories. Application which needs to use distributed cache to distribute a file should make sure that the file is available and can be accessed via URLS. URLs can be either hdfs:// or http://

Once the file is present on the mentioned URL, the Map-Reduce framework will copy the necessary files on all the nodes before initiation of the tasks on those nodes. In case the files provided are archives, these will be automatically un-archived on the nodes after transfer.

For Example: In Hadoop cluster we have three data nodes there are 30 tasks we run in the cluster. So each node will get 10 tasks each .our nature of task is such kind of task where it needs some information or particular jar to be adopted before its execution, to fulfill this we can cache this files which contains the info or jar files. Before execution of the job, the cache files will copy to each slave node application master. Application master than reads the files and start the tasks. The task can be mapper or reducer and these are read only files. By default Hadoop distributed cache is 10gb ,if you want to change the same you have to modify the size in mapred-site.xml. Here it’s coming to our mind that why cache memory is required to perform the tasks. why can’t we keep the file in HDFS on each data node already present and have the application read it .they are total 30 tasks and in real time it should be more than 100 or 1000 tasks .If we put the files in HDFS than to perform 30 tasks the application has to access the HDFS location 30 times and then read it but HDFS is not very efficient to access small files for this many times. this is the reason why we are using cache memory and it reduces the number of reads from HDFS locations

Author: user

Leave a Reply