Efficient Techniques for Mining Association Rules: - A Comparative Study

S . Ramaiah N; Jaya Krishna; Y . Suresh

doi:10.17577/IJERTCONV2IS15010

NCDMA - 2014 (Volume 2 - Issue 15)

Efficient Techniques for Mining Association Rules: – A Comparative Study

DOI : 10.17577/IJERTCONV2IS15010

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 398
Total Downloads : 2
Authors : S . Ramaiah N, Jaya Krishna, Y . Suresh
Paper ID : IJERTCONV2IS15010
Volume & Issue : NCDMA – 2014 (Volume 2 – Issue 15)
Published (First Online): 30-07-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Efficient Techniques for Mining Association Rules: – A Comparative Study

S . Ramaiah

Asst.Professor

N . Jaya Krishna

Asst.Professor,

Y . Suresh

Asst.Professor,

Dept. of CSE,KMMITS

Dept. of CSE ,SREC

Dept. of CSE,KMMITS

TIRUPATHI

Abstract: The amount of data is being stored for analyzation is gradually increasing with the advancement of IT technologies. It has resulted in large amount of data stored in databases. Thus the data mining plays major role in analyzing the databases to extract the required data, predicting unknown patterns and forming Association rules. In data mining, Association rule mining becomes one of the important tasks of descriptive technique which can be defined as discovering meaningful patterns form large collection of data. Mining Frequent Itemset is very fundamental part of association rule mining. Many algorithms have been proposed from last many decades including horizontal layout based techniques, vertical layout based techniques, and projected layout based techniques. But most of the techniques suffer from repeated database scan, and more number of Candidate generation in Apriori algorithm. In Fp-Tree algorithm, the memory consumption is more and it forms a tree. By using Mining Frequent Itemset Algorithm, it can generate Associations. As in medical retailer industry many transactional databases contain same set of transactions many times, to apply this thought, the present work compares the three popular algorithms namely Apriori algorithm, Fp-Tree algorithm and Mining Frequent Itemset algorithm. The present work observes the Database Scan, Candidate set generation, Conditional FP-Trees and Conditional Pattern Trees gives better performance in MFI over Apriori algorithm and FP-Tree algorithms.

Keywords: data mining, Association rules, candidate generation, Apriori, Conditional Fp-tree, MFI.

INTRODUCTION

Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years , due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration.

Data mining can be viewed as a result of the natural evolution of information technology. The

abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation. The fast-growing amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without power tools. Consequently, important decisions are often made based not on the information rich data stored in data repositories, but rather on a decision makers intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data. In addition, considered expert system technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases. But, this procedure is prone to biases and errors, and is extremely time- consuming and costly. Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools that will turn to knowledge.
Data evolution analysis descries and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of time related, distinct features of such an analysis include time series data analysis, sequence or periodicity pattern matching, and similarity based pattern analysis. Example, Time-series data. If the stock market data (time-series) of the last several years available from the New York Stock exchange and one would like to invest in shares of high tech industrial companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing to ones decision making regarding stock investments.
Data mining has become an essential technology for businesses and researchers in many fields, the number and variety of applications has been growing gradually for several years and it is predicted that it will carry on to grow. A number of the business areas of dataming into their processes are banking, insurance, retail and telecom More lately it has been implemented in pharmaceutics, health, government and all sorts of e- businesses.

One describes a scheme to generate a whole set of trading strategies that take into account application constraints, for example timing, current position and pricing [6]. The authors highlight the importance of developing a suitable back testing environment that enables the gathering of sufficient evidence to convince the end users that the system can be used in practice. They use an evolutionary computation approach that favors trading models with higher stability, which is essential for success in this application domain.

Apriori algorithm is used as a recommendation engine in an E-commerce system. Based on each visitors purchase history the system recommends related, potentially interesting, products. It is also used as basis for a CRM system as it allows the company itself to follow-up on customers purchases and to recommend other products by e-mail [4].

A government application is proposed by [5]. The problem is connected to the management of the risk associated with social security clients in Australia. The problem is confirmed as a sequence mining task. The ation ability of the model obtained is an essential concern of the authors. They concentrate on the difficult issue of performing an evaluation taking both technical and business interestingness into account.
PROBLEM STATEMENT

The problem of mining frequent itemsets arises in the large transactional databases when there is need to find the association rules among the transactional data for the growth of business. Mining frequent Itemset is the very crucial task to find the association rules between the various items. In the market industry, every one wants to enhance the business thus it is very important to find out the items which are more frequently sale or purchase. Once the selling or purchasing trend of the customer is known then one can easily provide the good services to customer with result in enhancing the business. As it is very common in retail selling database that two or more items sell or purchase together many times therefore database contains same set of items many times, by using this concept aim to overcome the limitation of above existing approaches and a novel approach to mine frequent itemsets from a large transactional database without the candidate with the help of existing techniques. Studies of frequent item set (or pattern) mining is acknowledged in the data mining field because of its broad applications in mining association rules, correlations and graph pattern constraint based on frequent patterns, sequential patterns and many other data mining tasks. For predicting the frequent items we using

basic popular algorithms. Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. We predict the frequent patterns based on the threshold values, those are minimum support of user defined and support of database (count) as well as minimum confidence.

In paper work, considered database as medical retailer of one month transaction for predicting the frequents of items and their associations. Market basket database is the most commonly database use to finding the frequent items. Medical transactional database and minimum support threshold is given, therefore the problem is to find the complete set of frequent itemsets from transactional type of databases to increase the business, so that relation between customers behavior can be found between various items (drugs). The medical retailer transactions of the one month transaction period are taken for predicting.

In Apriori algorithm, we use the input database as medical retailer transactional database is used to predict the frequent items, not only frequent items but also focused on predicting number of scans of data is required, and numbers of conditional trees as well as number of conditional pattern bases are required. While using medical retailer transaction database, how the Apriori algorithm works (run).

Fp-tree algorithm is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth. In FP-Tree algorithm using an extended prefix- tree structure for storing compressed and crucial information about frequent patterns named frequent pattern tree (FP-Tree), so we use the input as medical retailer database transactions, predict the trees and frequent patterns.

Mining Frequent Itemset algorithm is used for predict the association rules and also frequent patterns. We use the medical retailer transaction database as input of Mining Frequent Itemset algorithm

We compare the basic algorithms (Apriori, FP- Tree and MFI algorithm) for medical retailer transactional database predicting the frequent items and Associations in MFI algorithm over the Apriori and Frequent Pattern (FP- Tree) algorithms.
APRIORI, FP-TREE AND MFI ALGORITHMS

APRIORI ALGORITHM

The first algorithm for mining all frequent itemsets and strong association rules was the AIS algorithm by [10]. Then, the algorithm was improved and renamed as Apriori. Apriori algorithm is, the most classical and important algorithm for mining frequent itemsets. Apriori is used to find all frequent itemsets in a given database DB. The key idea of Apriori algorithm is to make multiple passes over the database. It employs an iterative approach known as a breadth-first search (level-wise search) through the search space, where k-itemsets are used to explore (k+1)-itemsets. The working of Apriori algorithm is fairly depends upon the Apriori property which states that All nonempty subsets of a frequent itemsets must be frequent
[12]. It also described the anti monotonic property which says if the system cannot pass the minimum support test, all its supersets will fail to pass the test [10 12]. Therefore if the one set is infrequent then all its supersets are also frequent and vice versa. This property is used to prune the infrequent candidate elements. In the beginning, the set of frequent 1-itemsets is found. The set of that contains one item, which satisfy the support threshold, is denoted by L1. In each subsequent pass, we begin with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data. At the end of the pass, we determine which of the candidate itemsets are actually large (frequent), and they become the seed for the next pass. Therefore, L is used to find L1, the set of frequent 2-itemsets, which is used to find L1, and so on, until no more frequent k-itemsets can be found. The feature first invented by [12] in Apriori algorithm is used by the many algorithms for frequent pattern generation. The basic steps to mine the frequent elements are as follows [10]:
1. Generate and test: In this first find the 1-itemset frequent elements L by scanning the database and removing all those elements from C which cannot satisfy the minimum support criteria.
2. Join step: To attain the next level elements CK join the previous frequent elements by self join i.e. LK-1 known as Cartesian product of Lk-1. i.e. This step generates new candidate K-itemsets based on joining Lk-1 with itself which is found in the previous iteration. Let CK denote candidate K-itemset and LK be the frequent K-itemset.
3. Prune step: CK is the superset of LK so members of CK may or may not be frequent but all K-1 frequent itemsets with the help of Apriori property. i.e. This step eliminates some of the candidate K-itemsets using the Apriori property A scan of the database to determine the count of each candidate in CK would result in the determination of LK(i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to LK). CK, however, can be huge, and so this could involve grave computation. To shrink the size of CK, the Apriori property is used as follows. Any (K-1)- itemset that is not frequent cannot be a subset of a frequent K-itemset. Hence, if any (K-1) subset of candidate K-itemset is not in LK-1 then the candidate cannot be frequent either and so can be removed from CK. Step 2 and 3 is repeated until no new candidate set is generated.
  
  To illustrate this, suppose n frequent 1-itemsets and minimum support is 1 then according to Apriori will generate n2 +(n 2) candidate 2- itemset (n 3) candidate 3- itemset and so on. The total number of candidates generated is greater than therefore suppose there are 1000 elements then 1499500 candidates are produced in 2 itemset frequent and 166167000 are produced in 3 itemset frequent [13].
  
  It is no doubt that Apriori algorithm successfully finds the frequent elements from the database. But as the dimensionality of the database increase with the number of items then:
  - More search space is needed and I/O cost will increase.
  - Number of database scan is increased thus candidate generation will increase results in increase in computtional cost.
    
    Therefore many variations have been takes place in the Apriori algorithm to minimize the above limitations arises due to increase in size of database.
    
    The algorithms improve the Apriori algorithms by:
  - Reduce passes of transaction database scans
  - Shrink number of candidates
  - Facilitate support counting of candidates
FP-TREE ALGORITHM

FP-Tree algorithm [4 14] is based upon the recursively divide and conquers strategy; first the set of frequent 1- itemset and their counts is discovered. With start from each frequent pattern, construct the conditional pattern base, then its conditional FP-tree is constructed (which is a prefix tree.). Until the resulting FP-tree is empty, or contains only one single path. (Single path will generate all the combinations of its sub-paths, each of which is a frequent pattern). The items in each transaction are processed in L order. (i.e. items in the set were sorted based on their frequencies in the descending order to form a list).

Conditional pattern base

A sub database which consists of the set of prefix paths in the FP-tree co-occurring with suffix pattern.eg for an itemset X, the set of prefix paths of X forms the conditional pattern base of X which co-occurs with X [14].

Steps of FP-tree:
1. Create root of the tree as a null.
2. After scanning the database D for finding the 1-itemset then process the each transaction in decreasing order of their frequency.
3. A new branch is created for each transaction with the corresponding support.
4. If same node is encountered in another transaction, just increment the support count by 1 of the common node.
5. Each item points to the occurrence in the tree using the chain of node-link by maintaining the header table.
After above process mining of the FP-tree will be done by Creating Conditional (sub)pattern bases:
1. Start from node constructs its conditional pattern base.
2. Then, Construct its conditional FP-tree & perform mining on such a tree.
3. Join the suffix patterns with a frequent pattern generated from a conditional FP-tree for achieving FP- growth.
4. The union of all frequent patterns found by above step gives the required frequent itemset.
MINING FREQUENT ITEMSET (MFI) ALGORITHM

MFI algorithm is used avoid candidate generation, the conventional FP growth algorithm needs to generate a large number of conditional pattern bases then calculate the conditional FP-tree. In case of large database it is inefficient and need a huge memory requirement that can occupies more memory space in main memory [13]. The MFI (mining frequent item set) algorithm is used to mine the frequent itemset. This algorithm uses the frequency of the improved FP-tree and stable count. The association rule mining have two steps- frequent item set generation and based on user defined support, find the valid association rules.

This algorithm has a simple approach to generate desired frequent item set that used to generate strong association rules. In this algorithm use the inputs parameters are user defined support S, frequency in Stable count C, frequency of the FP-tree F, root R. this algorithm take all possible combinations of items like existing FP- growth. This is take user defined support, based on support (database) and frequency of the FP-tree.

Steps of Mining Frequent Itemset:
1. Taking inputs parameters are FP-tree frequency, support of user defined, Stable count frequency, root node
2. Compares the each node in fp tree items to root node.
3. Suppose support of the item and frequency of the item in fp tree is same then it assigned to frequency of the frequent Itemset.
4. Generate item set with item and all possible combinations of the item and nodes with higher frequency in fp tree.
5. If suppose the support of the item is morethan the frequency of the item then adds the stable count to the frequency of frequent Itemset.
6. Generate Itemset with item and all possible combinations of item and all intermediate nodes up to most frequent item node in fp tree.
7. Suppose the support of the item is lesser than frequency of the item then it generate Itemset with item and all possible combinations of item and its parent node in fp tree.
  
  If the frequency of the item in FP-tree is equal to the user defined support, then the frequency of frequent item sets for that item is FP-tree frequency. And the frequent items should be only those items, which have greater or equal FP-tree frequency at FP-tree. Because association rules are only concerns in those items that satisfy the minimum support. In the case where support is greater than the FP- tree frequency, the frequency for the frequent item sets is the total of the FP- tree frequency and the Stable count. The main reason for adding is, as the FP tree represent the most frequent items, then some element (spare items) will store in Stable though they have high frequency of occurrence in main transaction dataset. The frequent items should be all intermediate nodes of the path up to the most frequent item node, as all the nodes at same path represent relation between nodes and for strong
  
  association rule, all possible relations need to be counted. In the last case where support is less than the FP- tree frequency, the frequency value for frequent item set is the frequency of the FP-tree. As the support is less than the frequency, no need to add the frequency of the Stable count. The frequent item sets are the only immediate parent nodes of the FP-tree. Because the entire parent node in improved FP-tree have greater frequency than the child node.
Experimental Results

CONCLUSIONS

In this paper, Database Scan, Candidate Set Generation, Conditional FP-Trees and Conditional Pattern Base consumption are considered for generating new scheme. The above factors also offered for finding the frequent item sets. The performances of the algorithms are strongly depends on the support levels and the features of the data sets. It is found that for a transactional database where many transaction items are repeated many times as a super set, In this case frequent Itemset algorithm is best suited for mining frequent itemsets. In the Apriori algorithm, number of database scans is more for generating Candidate sets. In the FP-Tree algorithm, items are in form of tree for finding frequent itemsets. Complete set of frequent items are generated in FP-Tree algorithm. The MFI (mining frequent Itemset) algorithm generates the association rules without candidate generation.
REFERENCES

A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. Int l Conf. Very Large Data Bases (VLDB), Sept. 1995, pages 432443.
Aggrawal.R, Imielinski.t, Swami.A. Mining Associat ion Rules between Sets of Items in Large Databases. In Proc. Intl Conf. of the 1993 ACM SIGMOD Conference Washington DC, USA.
Agrawal.R and Srikant.R. Fast algorithms for minin g association rules. In Proc. Intl Conf. Very

Large Data Bases (VLDB), Sept. 1994, pages 487499.
Brin.S, Motwani. R, Ullman. J.D, and S. Tsur. Dyna mic itemset counting and implication rules for market basket analysis. In P roc. ACM-

SIGMOD Intl Conf. Management of Data (SIGMOD), May 1997, pages 25526 4.
C. Borgelt. An Implementation of the FP- growth Algorithm. Proc. Workshop Open Software for Data Mining, 15.ACMPress, New Yo rk, NY, USA 2005.
Han.J, Pei.J, and Yin. Y. Mining frequent patterns without candidate generation. In Proc. ACM-SIGMOD Intl Conf. Management of Data (SIGMOD), 2000
Park. J. S, M.S. Chen, P.S. Yu. An effctive hash- based algorithm for mining association rules. In Proc. ACM-SIGMOD Intl Conf. Management of Data (SIGMOD), San Jose, CA, May 1995, pages 175186.
Pei.J, Han.J, Lu.H, Nishio.S. Tang. S. and Yang.

D. H-mine: Hyper-structure mining of frequent patterns in large databases. In Proc. Intl Conf. Data Mining (ICDM), November 2001.
C.Borgelt. Efficient Implementations of Apriori and Eclat. In Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations, CEUR Workshop Proceedings 90, Aachen, Germany 2003.
Toivonen.H. Sampling large databases for associati on rules. In Proc. Intl Conf.

Very Large Data Bases (VLDB), Sept. 1996, Bombay, India, pages 134145.
Yiwu Xie, Yutong Li, Chunli Wang, Mingyu Lu. The Optimization and Improvement of the Apriori Algorithm. In Proc. Int l Workshop on Education Technology and Training & International Workshop on Geoscience and Remote Sensing 2008.
Data mining Concepts and Techniques by By Jiawei Han, Micheline Kamber, Morgan Kaufmann Publishers, 2006.
S.P Latha, DR. N.Ramaraj. Algorithm for Efficient Data Mining. In Proc. Intl Conf. on IEEE International Computational Intelligence and Multimedia Applications, 2007, pp. 66-70.
Q.Lan, D.Zhang, B.Wu. A New Algorithm For Frequent Itemsets Mining Based On Apriori And FP-Tree. In Proc. Intl Conf. on Globa l Congress on Intelligent System, 2009, pp.360-364.
W.LIU, J.CHEn, S.Qu, W.Wan. An Improved Apriori Algorithm. In Proc. IEEE International Conference, 2008, pp.221-224.
S.P Latha, DR. N.Ramaraj. Agorithm for Efficient D ata Mining. In Proc. Intl Conf. IEEE International Computational Intelligence and Multimedia Aplications, 2007, pp. 66-70.
M. El-Hajj and O. R. Zaiane. Inverted matrix: Effi cient discovery of frequent items in large datasets in the context of interactive mining. In Proc. Intl Conf. on Data Mining and Knowledge Discovery (ACM SIGKDD), August 2003.
M. El-Hajj and O. R. Zaiane. COFI-tree Mining:A Ne w Approach to Pattern Growth with Reduced Candidacy Generation. Proceedi ngs of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, Melbourne, Florida, USA, CEUR Workshop Proceedings, vol. 90, pp. 112-119, 2003.
Y. G. Sucahyo and R. P. Gopalan. "CT-ITL: Efficient Frequent Item Set Mining Using a Compressed Prefix Tree with Pattern Growth". Proceedings of the 14th Australasian

Database Conference, Adelaide, Australia, 2003.

Efficient Techniques for Mining Association Rules: – A Comparative Study

Leave a Reply