 Open Access
 Total Downloads : 162
 Authors : Vvd Prasad Challuri, Blvv Kumar, K Purushotham Naidu, M Santosh
 Paper ID : IJERTV6IS060431
 Volume & Issue : Volume 06, Issue 06 (June 2017)
 DOI : http://dx.doi.org/10.17577/IJERTV6IS060431
 Published (First Online): 04072017
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Evaluation of Optimised Apriori Algorithm on HDFS using MapReduce in Hadoop Distributed Mode
1VVD Prasad Chelluri, 2BLVV Kumar, 3K. Purushotham Naidu,4M. Santhosh
1,2,3 Assistant Professor
Dept. Of Computer Science Engineering,
Gayatri Vidya Parishad College of Engineering for Women, Visakhapatnam, India
Abstract With a revolutionary change in data analytics it requires techniques that can equally extend with the trending data processing methods. To keep in pace with this elated progress in information evaluation, calibration and storage patterns, development and implementation of large scale algorithms for data processing is gaining importance. In datamining, association rule mining and classification is a wellutilised methodology for identifying overwhelming relations from data in large scale analytics. Apriori algorithm is one such crucial algorithm to mine the frequent item sets which form the basis for finding association rules among the data. Analyzing frequent item sets is a crucial step to find rules and association between them. This stands as a primary foundation for crucial decision making. With the advent of Hadoop MapReduce, parallel processing and efficient memory utilisation has come into order. This paper aims to identify the potential of Apriori Algorithm which is implemented as onephase and kphase Apriori algorithms in MapReduce framework and further an Optimised Apriori Algorithms(OAA) has been implemented which has a full fledged MapReduce benefits and it has been identified that Optimised Apriori Algorithm has yielded better efficiency and reduced time complexity.
Index Terms: Apriori Algorithm Optimised Apriori Algorithm, MapReduce.
INTRODUCTION
Recent technical trends in storage, processing and networking technologies lead to rapid growth of huge volumes of data in both scientific as well as commercial domains. Organizations are more inclined to make better use of this data and the data processing techniques to efficient decision making. Since the data is voluminous it requires appropriate and potential computing environments and framework to increase the precision that directly influence the decision making in real time scenario. Hadoop Framework is one such largescale distributed batch processing infrastructure for parallel processing of voluminous data which is otherwise called as BIG DATA that flows over huge cluster of commodity computers. Hadoop is an open source project of Apache that implemented Googles File System as Hadoop Distributed File System (HDFS) and Googles processing framework as Hadoop MapReduce programming model. All the algorithms in this paper were implemented on Hadoop using MapReduce paradigm. MapReduce is a parallel
programming model designed for parallel processing of large volumes of data by breaking the job into independent tasks across a large number of machines. MapReduce programming is inherited form the list processing languages e.g. LISP, that consists of two functions Mapper and Reducer which runs on all machines in a Hadoop cluster. The input output of these functions will be in form of <key, value> pairs. The Mapper reads the input <k1, v1>, from HDFS and produces a list of intermediate values <k2, v2>. An additional Combiner function which is optional is applied to reduce communication complexity in transferring intermediate outputs from mappers to reducers. Generally the output pairs of mapper are sorted locally and grouped on same key and applied as input to the combiner to make local sum.
With its efficient and rapid processing capabilities Hadoop has become a predominant tool for Data mining and knowledge discovery to extract useful, hidden and unknown patterns and knowledge from large database. There are many areas in datamining that generally considered for decision making. Association Rule mining is one such concept where Apriori is the basic and most popular algorithm for association rule mining proposed by

Agrawal and R. Srikanth for finding frequent itemsets based on candidate generation. Candidates are itemsets containing all frequent itemsets. The name of the algorithm Apriori is based on the Apriori property which states that all nonempty subsets of a frequent itemset must also be frequent. The core step of the algorithm is generation of candidate kitemsets Ck from frequent (k 1)itemsets Lk1.[1][2][3]There has been a wide variations in the implementation of Apriori Algorithms. In this paper we have implemented an Optimised Apriori Algorithm (OAA) and evaluated its performance against the OnePhase Apriori and KPhase Apriori algorithm where the results have evidently proven that the performance of OAA is much better when compared to the other two algorithms.[4][5][6]
Related Work
Apriori Algorithm: Onephase and KPhase
As the outline of the paper discussed earlier, since the Apriori lacks in efficiency while dealing with the voluminous data sets and even most of the optimized techniques of Apriori using MapReduce has elated its
efficiency using multiple techniques, our work aims at evaluating the performance of the Optimised Apriori Algorithm (OAA) with two basic Apriori Models one phase Apriori Algorithm and KPhase Apriori Algorithm.[4][6][7][8]
In onephase Apriori algorithm only a single phase of MapReduce task is considered for all frequent kitemsets even though it has little implementation complexity, but the time complexity is too high making it more inefficient. Our experimental results have underlined the same effect. The following algorithm explains the implementation of the onephase algorithm.
Map task: //one for each split Input: Si // Split i, line=transaction
Output: <key, 1>pairs, where key is an element of candidate itemsets.

For each transaction t in Si

Map(line offset,t)//map function

For each itemset I in t /*I = all possible subsets of t */ 4. Out(I,1);


End foreach

End map

End foreach

End Reduce Task:
Input: <key2, value2> pairs,
Minimum _support_court, where key is an element of the candidate Iternsets and value is its occurrence in each split
Output: <key, value >pairs, key is an element of frequent itemsets and value is its occurrence in the whole data set.

Reduce (key2, value2)// reduce fun.

Sum=0;

While (value 2.has next0)

Sum+=value2.get next0;

End while

If(sum>=min_sup_count)

Out(key2,sum);

End if

End reduce

End

For each transaction t in Si

Map(line offset,t) //map function

Foreach item I in t // I = token 4. Out(I,1);

End foreach

End map

End foreach

End Reduce Task:
Input: <key 2, value 2>pairs,
Minimum_support_court, where key2 is an element of the candidate k itemset and value 2 is its occurrence in each split
Output :< key3, value3> pairs, where key3 is an element of frequent kitemset and value3 is its occurrence in the whole dataset.

Reduce((key2,value2)//Reduce function

Sum=0;

While(value2 has next0)

Sum+=value 2 get next0;

End while

If (sum>=min_sup_count)

Out(key2,sum);//collected in Lk

End if

End reduce

End
Fig.2.Algorithmfor Kphase Apriori where k=1
Map Task: // one for each split Input: Si, Lk1
Output: <key, 1> pairs, key is an element of candidate k
itemset

Read Lk1 from Distribute Cache.

Ck=ap_gen(Lk1) // selfjoin

For each transaction t in Si

Map(line offset , t )//map function

Ct=subset(Ck ,t);

For each candidate C in Ct 7. Out(C,1);


End foreach

End map

End foreach

End Reduce Task:
The same reduce task as the previous phase
Fig.1.OnePhase Apriori Algorithm
In the Kphase Apriori Algorithm (where k=maximum length of frequent itemsets) the algorithm needs k phases (MapReduce jobs) to find all frequent kitemsets where phase one to find frequent 1itemset, phase two to find frequent 2itemset, and so on. The pseudocode of this algorithm is shown in figure 2 and 3.
Map Task: // one for each split Input: Si // Split i, line = transaction
Output: <key, 1>pairs, where key is an element of candidate k itemset
Fig.3.Algorithm for KPhase Apriori where K>=2
Proposed Algorithm: Optimised Apriori Algorithm (OAA) Taking into account the real time functioning of onephase and kphase Apriori Algorithm we have implemented an Optimised Apriori Algorithm which needs only two Map Reduce phases to find all frequent kitemsets
Fig.4. Explains the data flow of our proposed OAA where each input split is assigned a mapper that employs the map function, unlike the onephase and kphase Apriori, in OAA the value parameter of key <Key, value> takes the entire split as input rather than the one line transaction and
further minimum support count is considered to be a value equal to the number of transactions in the input split multiplied by minimum support threshold.The maps output is a list of intermediate key/value pairs: grouped by the key via combiner (optionally), and stored in the map worker; where the key is an element of partial frequent k itemsets and the value is its partial count. When all map tasks are finished, the reduce task (executed by reduce worker) is started. The maps output are shuffled (fetched)
to the reduce worker that calls a reduce function. The output of reduce function is a list (Lp) of key/value pairs, where the key is an element of partial frequent kitemsets and the value equal one, stored in HDFS. Figure 5.shows the pseudocode of this phase.
Map Task: // one for each split Input: Si // split i, line=transaction
Figure 4. Data Flow of Enhanced Apriori Algorithm (EAA)
In phase two (dashed arrows in figure 6), one extra input is added to the data flow of the previous phase, which is a file (copied from Hadoop Distributed Cache, which in the
Output :< key, value> pairs, where key is an element of partial frequent kitemsets and value is its partial occurrence

Map(object,Si) // Map function

L= apply_Apriori_on(Si);/*Partial_min_sup_count is used*/

For each itemset I in L

Out(I , partial count);

End foreach

End map

End Reduce task:
Input: < key 2, value 2>pairs, where key 2 is an element of the partial frequent kitemsets and value 2 is its occurrence in each split
Output: <key 3, 1>pairs, where key 3 is an element of global candidate frequent kitemsets

Reduce (key2, value2)// reduce fun.

Out(key2,1); // collected in Lp

End reduce

End
Fig:Psuedo Code of Phase one of Optimized Apriori Algorithm(OAA)
standalone mode in local file system) that contains all partial frequent kitemsets. The map function of this phase counts occurrence of each element of partial frequent k itemset in the split and outputs a list of key/value pairs, where the key is an element of partial frequent kitemset and the value is the total occurrence of this key in the split. The reduce function outputs a list (Lg) of key/value pairs, where the key is an element of global frequent kitemsets (subset of partial frequent kitemsets) and the value is its occurrence in the whole data set. Figure 6 shows the pseudocode of this phase.[7][8]
Map task: // one for each split Input: Si, Lp
Output: < key ,value >pairs, key is an element of Lp and value is its partial occurrence in the split

Read Lp from Distributed Cache.

Foreach itemset I in Lp

Map (object, Si) // Map function

Count=count_I_in_ Si (I, Si);

Out(I, count);

End map

End foreach

End
Reduce Task:
Input: <key2, value2>pairs, where key2 is an element of the global candidate kitemsets and value2 is its occurrence in each split
Output: <key3, value3>pairs, key 3 is an element of global frequent kitemsets and value3 is its global occurrence in the whole data set
12. reduce(key2,value2) // Reduce fun 13.

sum=0;

while(value2.hasNext0)

sum+=value2.getNext0;

end while

if (sum>=min_sup_count)

Out(key2,sum);// collected in Lg

End if

End reduce

End
OAA
One –
Phase
KPhase
2.5
2
1.5
1
0.5
0
OAA
One –
Phase
KPhase
4
3
2
1
0
Fig. Pseudo code for Phase Two of Optimised Apriori Algorithm
Experimental Setup: Result Evaluation
The experimental setup has been framed by building a Hadoop cluster that constitutes four clusters with each cluster having 4 nodes and each node had a Ubuntu 14.04 LTS operating system with Hadoop 2.6.0 using Map Reduce with Java 1.8.0121.
The data set used is T1014TD100K which has been generated by IBMs Quest Synthetic Data Generator. The total number of transactions are 25, 00,000 and each transaction contains 20 items on an average. The total number of items are 8000.The average length of frequent Item Sets are 4.[9][10][11]
The following graphs tabulate the performance of the three algorithms namely: onephase,Kphase and Optimised Apriori Algorithm (OAA)
500000 
1000000 
2000000 
2500000 

One phase 
1.06 
1.66 
1.78 
1.96 
KPhase 
0.32 
0.64 
0.89 
1.36 
OAA 
1.326 
2.2 
2.64 
3.01 
Performance Evaluation of Three Algorithms with increasing number of Transactions 
500000 
1000000 
2000000 
2500000 

One phase 
1.002 
1.64 
1.96 
2.2 
KPhase 
0.94 
1.3 
1.62 
1.7 
OAA 
0.17 
0.26 
0.78 
0.94 
Performance Evaluation of Three Algorithms at a minimum support threshold of 60 
CONCLUSION:
It has been evident from the experimental results that the implementation of Optimised Apriori Algorithm has yielded better results in all aspects when compared to One phase and KPhase algorithms. Even though the algorithms has put up better performance when implemented on distributed environment the optimisation has been more centric to Hadoop platform but the algorithm has evident
flexibility to equip an further optimisaton which can reduce the number of iterations thus making it outstandingly efficient. Further the implementation of the Optimised Apriori algorithm in Apache Spark is also anticipated to outperform the existing algorithm since Apache Spark gives more memory based computation hence reducing the complexity in iterating and I/O access.
REFERENCES:

IJIT – International Journal of Information Technology Bharati Vidyapeeths Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) July – December, 2015; Vol. 7 No. 2; ISSN 0973 5658 877 Implementation of Enhanced Apriori Algorithm with Map Reduce for Optimizing Big Data by Sunil Kumar Khatri and Diksha Deo.

Journal of Theoretical and Applied Information Technology 31st March 2016. Vol.85. No.3c 2005 – 2016 PARALLEL IMPLEMENTATION OF APRIORI ALGORITHMS ON THE HADOOPMAPREDUCE PLATFORM AN EVALUATION OF LITERATURE A.L.SAYETH SAABITH, ELANKOVAN SUNDARARAJAN, AND AZURALIZA ABU.

Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.6, November 2012 MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORI ALGORITHM FOR HANDLING VOLUMINOUS DATASETS by Anjan K Koundinya,Srinath N K,K A K Sharma, Kiran Kumar, Madhu M N and Kiran U Shanbag.

I J C T A, 9(17) 2016, pp. 85418548 International Science Press An Enhanced Apriori Algorithm for finding Frequency of Items over Big Transactional Data based on MapReduce Framework by Reshma Gummadi, Sudhir Tirumalasetty and Sreenivasa Reddy Edara.

Review of Apriori Based Algorithms on Map Reduce Framework by Sudhakar Singh, Rakhi Garg and P. K.
Mishra

AprioriMap/Reduce Algorithm Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA

RApriori: An Efficient Apriori based Algorithm on Spark by Sanjay Rathee,Indian Institute of Technology
,Mandi,Himachal Pradesh, India, Manohar Kaul ,Technische UniversitÃ¤t Berlin, Berlin, Germany, Arti Kashyap Indian Institute of technology Mandi Himachal Pardesh, India

International Journal of Networked and Distributed Computing, Vol. 1, No. 2 (April 2013), 8996 Published by Atlantis Press Parallel Implementation of Apriori Algorithm Based on MapReduce by Ning Li*.

Performance Evaluation of Apriori Algorithm on a Hadoop Cluster by JA NOSILLE Department of Automation and Applied Informatics Budapest University of Technology and Economics Magyar 2. (Building Q), 1117 Budapest HUNGARY.

K Murali Gopal et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7 (6)
, 2016, 24422444 Performance Analysis of Association Rule Mining Using Hadoop K Murali Gopal Ranjit Patnaik Dept. of Computer Science and Engineering, Gandhi Institute of Engineering and Technology, Gunupur.

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster International Journal of Computer Applications Â· October 2015.