A Review on Parallel Genetic Algorithm Models for Map Reduce in Big Data

Kiranjit Pattnaik; B.   S.   P Mishra

doi:10.17577/IJERTV5IS080400

Volume 05, Issue 08 (August 2016)

A Review on Parallel Genetic Algorithm Models for Map Reduce in Big Data

DOI : 10.17577/IJERTV5IS080400

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 78
Total Downloads : 168
Authors : Kiranjit Pattnaik, B. S. P Mishra
Paper ID : IJERTV5IS080400
Volume & Issue : Volume 05, Issue 08 (August 2016)
DOI : http://dx.doi.org/10.17577/IJERTV5IS080400
Published (First Online): 29-08-2016
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Review on Parallel Genetic Algorithm Models for Map Reduce in Big Data

Kiranjit Pattnaik

School of Computer Engineering KIIT University

Bhubaneswar-751024, Odisha, India.

Bhabani Shankar Prasad Mishra School of Computer Engineering KIIT University

Bhubaneswar-751024, Odisha, India.

Abstract – Map Reduce is the basic software framework used in the field of Big Data because of its high scalability. The use of parallel genetic algorithms in map reduce gives more accurate close to optimal value and due to its parallel nature huge volume of data can be handled properly. This survey aims to collect and organize most of the recent publications on the parallel genetic algorithms using map reduce. This paper presents a comparative view of different parallel models of genetic algorithm for map reduce and also provides a detail information on different platform and tools for map reduce.

Keywords – Parallel Genetic Algorithm, Map Reduce, Big Data, Map Reduce using Global Parallelization, Map Reduce using Coarse Grained Parallelization, Map Reduce using Fine Grained Parallelization.

INTRODUCTION

Big data is a new generation of architectures & technologies, which extracts data from very huge volume of different varieties of data and by implementing high velocity capture, discovery and/or analysis.

Basically there exist 3 Vs of Big Data ([1], [2]) as
- Volume – How much data (Yottabytes, Pettabytes)
- Velocity – How fast the data is processed
- Variety – Different types of data
  
  But advance research has shown that only these 3 Vs of Big Data is not sufficient enough to describe the Big Data. So, 2 more Vs of Big Data has been summarized.
- Veracity – Uncertainty of data (Data incompleteness)
- Value – Turn Big Data into values else useless (business perspective)
  
  Big data is a combination of the new and old data- management technologies that has been developed over time. To gain the true insights in the organizations Big Data allows organizations to manage, manipulate and store the huge volume of data at the required time and precise speed.
  
  Map Reduce [3] deals with a large volume of data. Evolutionary algorithms can be used in Map Reduce for further optimization of the data. EAs like Genetic Algorithm can be applied to large search space due to its efficient use of operation like selection, crossover and mutation. As GA operates one time consuming so, parallelization of GA can be a solution to it. The parallel GA does parallel search from multiple points in the space and has higher efficiency and efficacy than the sequential
  
  GAs. It works on the coding of the problem and gives a better search even if no parallel hardware is used. The parallel GAs yield alternate solutions to a problem and is basically robust in nature and has access to easy parallelization as islands or neighborhood as well as has an easy cooperation with other search procedures.
  
  This paper explains how different PGA models can be combined in Map Reduce ([4], [5]) technique for the abstraction of huge data set. The rest of the paper is organized as below. Section 2 describes about the Map Reduce technique. Section 3 describes about different PGA models. PGA based Map Reduce is described in Section 4. Comparison of different PGA models and conclusion is described in Section 5 and 6 respectively.
MAP REDUCE

Map Reduce is a framework in which huge data sets are processed parallely using a large number of computers or also termed as nodes. The nodes are denoted as a cluster or grid depending if the nodes are all on the same local network using similar hardware or are shared across distributed network using heterogeneous hardware. The data is stored either in an unstructured file system or in a structured database.

Map Reduce is called the heart of Hadoop. Huge scalability is provided across hundreds or thousands of servers in a Hadoop cluster. The Hadoop programs perform two distinct tasks of Map Reduce. The first being the Map task, where a set of data is taken as input and then changed into another set of data for further processing, where each individual component is reduced into key/value pairs. The second is the Reduce task whose input is the output produced by the Map task and then it combines those key/value pairs into another set of key/value pairs. The reduce task always comes into action after the Map task. The basic phases involved in map reduce are:
- Map: The map function is applied by each worker to the local data, and writes the output to a temporary storage. Here the master node sees that if duplicate copies of the input data is present then only one of them is processed.
- Shuffle: The data produced by the map function is distributed across the worker nodes based on the output keys such that data belonging to a particular key is located on the same worker node.
- Reduce: The reduce function is applied on each worker node to process the output data from the shuffle phase as per key/value in parallel and produce the final output.
Map Function : Map (key 1, value 1) -> list (key 2, value 2)

Reduce Function : Reduce (key 2, list (value 2)) -> list (key 3, value 3)

Figure 1: Two Phases of Map Reduce
1. Divide the input into m parts where m is number of mappers and send it to each mapper.

2. Execute the map function on the input for each mapper.

3. Combine the similar sets key/value pairs and send it to the shuffle/sort phase.

4. Sort the intermediate key/value pairs as per the key.

5. Execute each list of intermediate key/value pairs by the reduce function (if there exists r reduce phases then divide it into r part and execute the reduce function)

6. Summarize or combine the output of the reduce function to form the output
PARALLEL GENETIC ALGORITHM (PGA) Genetic algorithms [6] are stochastic search algorithms centered on ethics of natural selection and recombination. It helps to find the optimal solution to the problem by handling the population of the required solutions. The evaluation of the population is done and the best solutions are selected out of it to mate & reproduce to become candidates for th next generation. After a series of generations, the good traits lead the population, which results in the better quality of the solutions.

The Darwinian evolution is the basic mechanism of the genetic algorithms in which the bad traits are excluded from the population as they appear in individuals which cannot last the process of selection. On the other hand the good traits survive and are mated with others to form better individuals. There also exists an operator in GA termed Mutation which is considered a secondary whose purpose is to guarantee that diversity in the population is not lost, so that the GA can carry on its search. Crossover and Mutation are the two operators in simple GAs which are centered on the natural genetics to explore the search space. The genetic algorithm unlike other optimization techniques apply various genetic operators on the population and are modified for the next generation until a global optimum is found out. For nontrivial problems high computational resource i.e. larger memory and more number of search is required. With this goal many new advances are achieved continuously by designing or using new operators, termination criteria, different hybrid algorithms, etc.

Parallel Genetic Algorithms (PGAs) are not only just the parallel versions of the sequential genetic algorithms but having a parallel algorithm whose behavioral results is better than the sum of the individual behaviors of its constituent sub-algorithms.

3.1 Types of PGA Models

There are different types of parallel genetic algorithm (PGA) models and the parallelization is done on the population i.e. some use single population while others use each individuals or the population being sub divided into various subpopulations. Of these some methods can achieve massive parallel computer architectures while others to multicomputer with some powerful processing elements. Basically there exist three models of parallel genetic algorithm ([7], [8], [9]) and they are

Master Slave Model GA (Global Parallelization)
Grid Model GA (Fine Grained Parallelization)
Island Model GA (Coarse Grained Parallelization)

In a Master Slave GA or the Global Parallelization there exists a single population, but the fitness evaluation is scattered among several different processors. In this model, the whole population is considered by the selection and crossover operators. In the Global Parallelization the population is managed by the master process and the individuals are forwarded to the slave processors for evaluation. After the evaluation is complete, the results are collected by the master and the genetic operators are

applied to yield the next generation. This architecture can be implemented on both shared memory multiprocessors as well as distributed memory machines including networked computers. The architecture of the Master Slave Model is shown in Figure 2 :

Figure 2: Architecture of the Master Slave Model

In the Grid model or the Fine Grained parallelization all the operations are done on each individual rather than on the whole population. This model consist of one spatially structured population and is suitable for massive parallel computers. In this model selection and mating are limited to a small neighborhood, but the neighborhoods overlap allowing some communication among all the other individuals. The ideal case would be having only one individual for each available processing element. The architecture of the Grid Model is shown in Figure 3 :

Figure 3: Architecture of the Grid Model

The Coarse Grained parallel GAs are also very much similar to the master slave model in structure but it mainly works with the sub-populations i.e. the population is sub divided into various groups and each group is then mapped separately and executed and it follows a new method termed migration which works with each individual chromosomes and the best one is migrated to the next step. In this model on each subpopulation the GA operations are performed and the subpopulations interchange data information by permitting individuals to allocate from one island to another as per the given criteria. The architecture if the Island Model is shown in the Figure 4 :

Figure 4: Architecture of the Island Model
PGA BASED MAP REDUCE

There are three basic models of PGA like Global Parallelization, Coarse Grained Parallelization and the Fine Grained Parallelization. In the Map Reduce framework the map and the reduce phase is user defined and it can be modified as per the requirements. In such a scenario the Parallel Genetic Algorithms is used in its map and the

reduce phase accordingly following the specific parallelization method.
As final level of parallelization, a fine-grained model can be achieved with a slight variation in the last proposal. In this model, each chromosome is assigned to a single Mapper and then to a randomly selected Reducer. This can be obtained by acting on the way the Mapper module produces key values, which now are pseudo-randomly generated. (Pseudo-Random function is a way to implement the neighboring required by the fine-grained model.)

Rohit Konkedar et. al. (2012) [14] proposed A MapReduce based Hybrid Genetic Algorithm using Island Approach for solving Time Dependent Vehicle Routing Problem in which a distributed Island model PGA is implemented where the execution of its own algorithm and the maintenance of the sub population is carried out by each node where probabilistic selection and replacement based on fitness is used. Crossover and mutation is performed on each island thereby building a diverse population. The proposed model architecture is shown in Figure 8 :

Figure 8: Architecture of the Map Reduce using Hybrid Parallelization

Steps involved in the proposed algorithm :

1. Generate the initial population using Random search and the Savings Algorithm in which the route is constructed and listed in decreasing order as well as by using the NNC Algorithm which is an improved algorithm of the nearest neighbor algorithm for the constructing the route. The population size is maintained to be around 200 per reducer.

2. Map Phase : In this phase fitness evaluation and 2- opt local optimization is carried out. The 2-opt local optimization is a local search algorithm in which the neighborhood exploration is enhanced by maintaining a certain probability by removing the two edges from the route and connecting the two paths created. Island number is assigned randomly to each individual with results as key and value pairs, in which key is the island number and value is the mixture of string representation of individual and the fitness in which the routes are separated by #.In the following map phases key is the island number of the previous phase and same operations are performed on the individuals.

3. Reduce Phase : The individual would go to that reducer partition which is decided by the island number. In the reducer, the individuals are ordered in ascending order as per the fitness value and two individuals are chosen for the crossover operation by using the ranking method. Maximal routes preservation and mutation is done by the Maximal route preserving crossover (MRPC). With certain probability migration is carried out by changing the key which is the island number of each individual. Individuals are produced in form of key and value pair, where the island number is termed the key and the child generated from crossover would be the value. The individual having maximum fitness during the selection process for elitism is emitted. The maximum fitness of each reducer is written in separate files, which are grouped after each map reduce phases and then are sorted and read for testing the termination norms. Until the termination norm is satisfied these steps are repeated.

DISCUSSION

Various different parallel genetic algorithm models using map reduce has been stated before and now a comparison is made between these models along with the environments used.

Comparison

Verma et. al. (2009) provided another approach for using MapReduce for parallel GAs and highlighted several concerns and limitations of the proposal described in Gin et. al. (2008). The main concern was about scalability of parallel GAs. The previous proposal max limit was 32 nodes, due to the essential serial organization imposed by the use of a single coordinator, which performs mutation, crossover, and evaluation of the convergence criteria. Furthermore, they highlighted that the extension proposed in Gin et. al. (2008) could easily be implemented in the traditional MapReduce model using a Combiner instead of the proposed local reduce, as shown in Dean et. al. (2008). Finally they pointed out that using the same value as keys

produced by the mapper, local reducer and the reducer, MRPGA does not hire any characteristics of the Map Reduce model (i.e., the shuffle and the group by keys). Differently, Verma et al. (2009) exploited the traditional MapReduce model by implementing GA in Hadoop to solve the onemax problem. They investigated scalability and convergence of the proposed GA and found that larger problems could be resolved by adding more computational resources without changing algorithm implementation.

The model presented by A. Verma et. al. (2009) had a big input output imprint because full population is saved to HDFS after each generation of GA and due to this huge performance degradation is caused but there exists no species problem whereas in the model presented by Dino Keco et. al. (2012) most of operations performed has been moved from reduce phase to map phase. This change reduces amount of input output imprint because instead of HDFS all the processing data is kept in the memory, but species problem arises in this model as we have different population for each node.

In the third model an extension of Map Reduce is done with the addition of a second reduce phase and a special optimization on the merge phase is carried out for the second reduce operation. For handling heterogeneity and failures this extension of PGA on map reduce is beneficial.

The Hybrid approach proposed by Rohit Konkedar et. al. (2012) applies the local search algorithm for further optimizations and also generates initial population of very good quality. Large scale Vehicle Routing Problem with hundreds or thousands of customer nodes can be solved easily with very less time and full utilization of resource which is impossible in single machine.

The coarse grained model for the Automatic generation of Junit Test cases using PGA based Map Reduce proposed by Federica Sarro et. al. (2012) showed that using this technique on a small or less number of clusters the results obtained saved 50% of the using by using PGA and Map Reduce.

In the paper by Chao Jin et. al. (2008) on an Extension of Map Reduce for Parallelization of Genetic Algorithm it was observed that this extension benefits from the Map Reduce model on handling heterogeneity and failures.
Environment and Tools

Table 1: Tools and Environments used in different papers for experimentation.

	Name of the Author( s)	Year of publicati on	Process or	Operati ng System	No. of nodes Hadoo p Cluster	R A M (in G B)
Scaling Genetic Algorithm using Map Reduce	A. Verma et. al. [11]	2009	Dual Intel Quad Core	Red Hat Enterpri se	52	16
Parallelization of Genetic Algorithms using Hadoop Map/Reduce	Dino Keco et. al. [10]	2012	i7 Quad Core	Cent OS 5.6	10	4
A parallel Genetic Algorithm Based on Hadoop Map Reduce for the Automatic Generation of Junit Test Suites	Federic a Sarro et. al. [12]	2012	Intel Core i3 2100	Window s 7 Home Premiu m SP1 64 bit	3	4
A Map	Rohit	2012	Intel	Ubuntu	7	3.
Reduce based	Kondek		Core	11.04		1
Hybrid	ar et. al.		i3-2100
Genetic	[14]		(Dual
Algorithm			Core)
using Island
Approach for
solving Time
Dependent
Vehicle
Routing
Problem
MRPGA : An	Chao	2008	Pentiu	Window	33	1
Extension of	Jin et.		m 4	s XP
Map Reduce	al. [13]
for
Parallelizing
Genetic
Algorithms

CONCLUSION

In this paper we reviewed some of the most recent representative publications on different models of parallel genetic algorithm using map reduce in the Hadoop platform for big data. This survey focused on the use of different models of parallel genetic algorithm for map reduce and the review started by giving an introduction on big data and the map reduce technology vastly used in big data and classified our work into three categories based on the three different parallel genetic algorithm models namely the global parallelization, coarse grained parallelization and the fine grained parallelization. Contributions by the associated authors in each of these categories using map reduce framework was discussed and the merits and demerits of each of the parallel model using map reduce was discussed and compared with each other. The use of different parallel models of genetic algorithm in

map reduce would help to analyze and execute the huge volume of data in big data efficiently in a parallel manner.

REFERENCES

Judith Hurwitz, Alan Nugent, Dr. Fern Halper, Marcia Kaufman, Big Data for Dummies, John Wiley & Sons, Inc., 2013, ch. 1.
Md. Rezaul Karim, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, Ho-Jin Choi, An efficient Distributed Programming Model for Mining Useful Patterns in Big Datasets, IETE Technical Review, vol. 30, pp. 53-63, 2013.
Cairong Yan, Yongfeng Cairong, Guangwei XU, Runtimes and Optimizations for Map Reduce, IETE Technical Review, vol. 30, pp. 506-515, 2013.
Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM-50th anniversary issue: 1958-2008, vol. 51, pp. 107- 113, 2008.
Xiao Yang, Jialing Sun, An Analytical performance model of MapReduce, IEEE Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 306-310, 2011.
Frenzel,J.F, Genetic Algorithms, Potentials IEEE, vol. 12, pp. 21-24, 2002.
B.S.P. Mishra, S. Dehuri, Parallel Computing Environments, IETE Technical Review, Vol. 28, Iss. 3, pp. 240-247, 2011.
B.S.P. Mishra, S. Dehuri, R. Mall, A. Ghosh, Parallel Single and Multiple Objectives Genetic Algorithms: A Survey, International Journal of Applied Evolutionary Computation, vol. 2, 2011.
Erick CantÃº-Paz, A survey of Parallel Genetic Algorithms, Calculateurs Parallelles, Reseaux et Systems Repartis 10(2), pp. 141 171, 1998.
Dino Keco, Abdulhamit Subasi, Parallelization of genetic algorithms using Hadoop Map/Reduce, South East Europe Journal of Soft Computing, 2012.
Abhishek Verma, XavierLlor'a, David E. Goldberg, Roy H. Campbell, Scaling Genetic Algorithms using MapReduce, Ninth International Conference on Intelligent Systems Design and Applications, ISDA 09, pp. 13-18, 2009.
Linda Di Geronimo, Filomena Ferrucci, Alfonso Murolo, Federica Sarro, A Parallel Genetic Algorithm Based on Hadoop MapReduce for the Automatic Generation of JUnit Test Suites, 2012 IEEE Fifth Conference on Software Testing, Verification and Validation, pp. 785 793, 2012.
Chao Jin, Vecchiola, C., Buyya, R., MRPGA: An Extension of MapReduce for parallelizing Genetic Algorithms, IEEE Fourth International Conference on eScience, 2008.eScience08, pp. 214-221, 2008.
Rohit Kondekar, Akash Gupta, Gulshan Saluja, Richa Maru, Ankit Rokde, Parag Deshpande, A MapReduce Based Hybrid Genetic Algorithm Using Island Approach for Solving Time Dependent Vehicle Routing Problem, 2012 International Conference on Computer and Information Science (ICCIS), vol. 1, pp. 263 269, 2012.

A Review on Parallel Genetic Algorithm Models for Map Reduce in Big Data

Leave a Reply