A Study on Hadoop MapReduce Techniques and Applications on Grid Computing

Ila Savant; Richa Muke; Nilay Narlawar

doi:10.17577/IJERTV2IS120981

Volume 02, Issue 12 (December 2013)

A Study on Hadoop MapReduce Techniques and Applications on Grid Computing

DOI : 10.17577/IJERTV2IS120981

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 333
Total Downloads : 543
Authors : Ila Savant, Richa Muke, Nilay Narlawar
Paper ID : IJERTV2IS120981
Volume & Issue : Volume 02, Issue 12 (December 2013)
Published (First Online): 24-12-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Study on Hadoop MapReduce Techniques and Applications on Grid Computing

Ila Savant	Richa Muke	Nilay Narlawar
M.Tech CSE*	B.E Computer*	B.E Computer*
IES, Bhopal	MMCOE, Pune	MMCOE, Pune

Abstract

Hadoop is an open source programming framework which provides processing of voluminous amounts of data using distributed computing environment. Hadoop creates clusters of inexpensive computers and distributes work among them. MapReduce is the core of Hadoop technology. It enables the massive scalability across hundreds to thousands of servers in a Hadoop cluster. In this paper, we focused Hadoops MapReduce techniques and their comparative study. Also we discussed MapReduce application on grid computing, image processing to deal with big data problem.

Keywords

Distributed computing, clusters, MapReduce, Grid computing.

Introduction

Almost 90% of the data produced worldwide has been created in the last few years alone, that is, 2.5 quintillion bytes per day. Recent times have seen enormous volumes of data growing at a rapid rate. Therefore, new strategies and processes are needed to process this burgeoning data. In addition, a change has been observed in the format of data produced, from conventional text based data to unstructured sources including web server logs, blogs, images and social media streams. These changing data formats have overwhelmed the existing systems. Also, the old systems had some limitations. Businesses now needed not just queries but also processing. Apache Hadoop, an open source software framework, is an optimal solution to all these problems. Hadoop uses two core functionalities, namely, HDFS (Hadoop Distributed File System) and MapReduce.

Fig.1 working of map and reduce technique

Fig.1: The input to the MapReduce process in figure is a set of colored circles. The objective is to count the number of each color. The programmers in this example are responsible for coding the map and reduce programs; the remainder of the processing is handled by the software system implementing the MapReduce programming model.

MapReduce technology was pioneered by Google for processing of their large data generated every day. It is a form of abstraction which highlights the necessary computations and hides the unnecessary and tedious details such as parallelization and load balancing. It is an implementation of the Map and Reduce functionalities used in common functional languages. Map will map the records in input, while Reduce will combine all the values having same key. This allows parallelization to a large extent.

Hadoop includes a huge storage system called the Hadoop Distributed File System (HDFS). It is capable of storing massive amounts of data and is resistant to failure of vital parts of the infrastructure. The input files are broken into parts called "blocks" by HDFS and are stored redundant across the pool of servers, thus multiple copies of data can be found in the network. In a situation where one of the nodes fails, HDFS notifies it and creates new copies of data from these redundant data. Also, multiple copies of data avoid bottlenecks in the system.
Related Work

A Hadoop cluster is composed of two parts: Hadoop Distributed File System and MapReduce. A Hadoop cluster uses Hadoop Distributed File System (HDFS) to manage its data. HDFS provides storage for the MapReduce jobs input and output data. It is designed as a highly fault-tolerant, high throughput, and high capacity distributed file system. In [2], the authors highlight on Parallel Genetic Algorithm for the automatic generation of test suites. The solution is based on Hadoop MapReduce since it is well supported to work also in the cloud and on graphic cards, thus being an ideal candidate for high scalable parallelization of Genetic Algorithms. The amount of images being uploaded to the internet is rapidly increasing, with Facebook users uploading over 2.5 billion new photos

each month, however, applications that make use of this data are severely lacking. In [1], it is proposed that an open-source Hadoop Image Processing Interface (HIPI) that aims to create an interface for computer vision with MapReduce technology. HIPI abstracts the highly technical details of Hadoops system and is flexible enough to implement many techniques in current computer vision literature.
MapReduce Techniques
1. Data placement in Heterogeneous cluster In heterogeneous clusters, the capacity of computation of the nodes can be significantly different. A node with a high computing capacity can process data in a short span of time compared to the slow processing node. This algorithm has two parts: first, to distribute files among heterogeneous clusters, second, is used to reorganize the fragments and to solve the data skew problems.
  1. Initial data placement
    
    [4] Starts with dividing the large input size in fragments of even size. Then the placement algorithm distributes these fragments among the nodes according to their capacities. It distributes fragments in such a way that all the nodes complete the given task at the same time. Also the disk storage of high capacity nodes is more. Hence, the high capacity nodes are expected to do more work. It has been observed that the computing capacity and along with the response time is stable for certain Hadoop applications. This is because the response time is directly proportional to the input data.
  2. Data redistribution
    
    But the distribution using the above algorithm might be disrupted due to the following reasons: new data blocks might be added or deleted from the existing inputs, or new nodes can be added in the existing clusters. Therefore, in this algorithm it distributes the data according to computing ratio of the nodes. Here, in the algorithm, the data distribution server collects the information regarding network topology and disk storage capacity. Next, it creates two lists: first where the total number of local fragments exceeds its computing capacity and the second, where the number of local fragments is less than the capacity of the node. The former list is called over utilized node list, while the second is called as underutilized node list. Now, the data redistribution server moves the fragments from the over utilized list to the underutilized list. The nodes in over- utilized lists are source node whereas the nodes in the underutilized lists are destination nodes during migration of fragment files. This process is repeated until the total number of local fragments in each node matches its speed as measured by computing ratio.
    The goal of implementation in [8] is to provide a Hadoop platform comparable to that of a dedicated cluster for users to run on the grid. It should be necessary to change their MapReduce code in order to run on the proposed adaptation of Hadoop. Therefore, the authors have made no API changes to MapReduce, only underlying changes in order to better fit the grid usage model. When the grid job begins, it starts the tasktracker on the remote worker node. The tasktracker is in charge of managing the execution of Map and Reduce tasks on the worker node. When it begins, it contacts the jobtracker on the central server which marks the node available for processing. The tasktrackers report their status to the jobtracker and accept task assignments from it. In the current version of HOG, they have followed Apache Hadoops FIFO job scheduling policy with predictive execution enabled. At any time, a task has at most two copies of execution in the system. The communication between the tasktracker and the jobtracker is based on HTTP. In the HOG system, the HTTP requests and responses are over the WAN which has high latency and long transmission time compared with the LAN of a cluster. Because of this increased communication latency, it is expected that the startup and data transfer initiations will be increased.
Comparative Study Of Techniques

Table 1: comparative study
Conclusion

Big data and associated technologies can bring significant benefits to the business, but as the use of these technologies grows, it will become difficult for organizations to tightly control all of the many forms of data used for analysis and investigation. MapReduce can be a good approach on grid computing and image retrieval to deal with such a big data problem.
References

Chris Sweeney Liu Liu Sean Arietta Jason Lawrence, HIPI: A Hadoop Image Processing Interface for Image- based MapReduce Tasks, University of Virginia.
Linda Di Geronimo, Filomena Ferrucci, Alfonso Murolo, A Parallel Genetic Algorithm Based on Hadoop MapReduce for the Automatic Generation of JUnit Test Suites,2012 IEEE Fifth International

Conference on Software Testing, Verification and Validation.
Avrilia Floratou University of WisconsinMadison Jignesh M. Patel University of WisconsinMadison Eugene J. Shekita IBM Almaden Research Center, Column Oriented Storage Techniques for MapReduce.
Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin, Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters, Department of Computer Science and Software Engineering Auburn University.
Apache Hadoop Map Reduce, http://hadoop.apache.org/mapreduce/
Max Grossman, Department of Computer Science Rice University,HadoopCL: MapReduce on Distributed Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL, 2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum.
Tomai, A. Rashkovska and M. Depolli, Joef Stefan Institute/Depatment of Communication Systems, Using Hadoop MapReduce in a Multicluster Environment, MIPRO 2013, May 20-24, 2013, Opatija, Croatia.
Chen He, Derek Weitzel, David Swanson, Ying Lu, Computer Science and Engineering, University of Nebraska Lincoln HOG: Distributed Hadoop MapReduce on the Grid, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
GuoweiWang School of Computer Science and Technology Henan Polytechnic University A Two- phase Execution Engine of Reduce Tasks In Hadoop MapReduce, 2012 International Conference on Systems and Informatics (ICSAI 2012).

A Study on Hadoop MapReduce Techniques and Applications on Grid Computing

Leave a Reply