 Open Access
 Total Downloads : 192
 Authors : Rajeev Kumar Gupta, Prof. Roshni Dubey
 Paper ID : IJERTV2IS60539
 Volume & Issue : Volume 02, Issue 06 (June 2013)
 Published (First Online): 19062013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
An Improved Association Rule Mining With Correalation Technique
Rajeev Kumar Gupta Prof. Roshni Dubey
SRIT Jabalpur SRIT Jabalpur
Abstract Construction and development of classifier that work with more accuracy and perform efficiently for large database is one of the key task of data mining techniques [l7] [18]. Secondly training dataset repeatedly produces massive amount of rules. Its very tough to store, retrieve, prune, and sort a huge number of rules proficiently before applying to a classifier [1]. In such situation FP is the best choice but problem with this approach is that it generates redundant FP Tree. A Frequent pattern tree (FPtree) is a type of prefix tree

that allows the detection of recurrent (frequent) item set exclusive of the candidate item set generation [14]. It is anticipated to recuperate the flaw of existing mining methods. FP Trees pursues the divide and conquers tactic. In this paper we have adopt the same idea of author [17] to deal with large database. For this we have integrated a positive and negative rule mining concept with frequent pattern (FP) of classification. Our method performs well and produces unique rules without ambiguity.
Keywords Association ,FP, FPTree, Nagtive, Positive.

INTRODUCTION
Mining using Association rules discover appealing links or relationship among the data items sets from huge amount of data [4]. For this association uses various techniques like Apriori and frequent pattern rules, even though Apriori employ cuttechnology while generating item sets, it examine the whole database during scanning of the the transaction database every time. This resulting scanning speed is gradually decreased as the data size is growing [4].
Second wellknown algorithm is Frequent Pattern (FP) growth algorithm it takes up divideandconquer approach. FP computes the frequent items and forms in a tree of frequentpattern.
In comparison with Apriori algorithm FP is much superior in case of efficiency [13]. But problem with traditional FP is that it produces a huge number of conditional FP trees [3].
Construction and development of classifier that work with more accuracy and perform efficiently for large database is one of the key task of data mining techniques [l7] [18]. Secondly training dataset repeatedly produces massive amount of rules. Its very tough to store, retrieve, prune, and sort a huge number of rules proficiently before applying to a classifier [1]. For eliminate such problems Author of [17] proposed a new method based on positive and negative concept of association rule mining. Authors argue that the customary techniques of classification based on the positive association rules and ignores the value of negative association rules.
In this paper we have adopt the same idea of author [17] to deal with large database. For this we have integrated a positive and negative rule mining concept with frequent pattern (FP) of classification. Our method performs well and produces unique rules without ambiguity.
Rest of papers are organized as follows, section two insight the background details of the association data mining technique and also explore the idea of FP and positive and negative theory. Section 3 discusses the previous works in same field. Section 4 discusses about the proposed method and algorithm adopted. Section 5 presents the results obtained by the proposed method and finally section 6 concludes the paper.

BACKGROUNDS & RELATED TERMINOLOGY

Association
Association rule was proposed by Rakesh Agrawal [1]; its uses the "ifthen" rules to generate extracted information into the form transaction statements [3]. Such rules have been created from the dataset and it obtains with the help of support and confidence of apiece rule that illustrate the rate (frequency) of occurrence of a given rule.
According to the Author of [2] Association mining may be can he stated as follows: Let I = (i1,i2in) be a set of items. Let D = (T1, T2Tj,Tm) the taskrelevant data, be a set of transactions in a database, where each transaction
Tj(j=12 m) such that Tj I . Each transaction is
assigned an identifier, called TID (Transaction id). Let A be a set of items, a transaction T is said to contain A if and only if AI. An association rule is an implication of the form A B where AI BI and
AB=. The rule AB holds in the transaction set D
with support s, where s is the percentage of transactions in D that contain A B (i.e., both A and B). This is taken to be the probability P(A B). The rule has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P (BA). That is,
confidence(A>B)= P (BA)=support(AÂ¬B)
support(A)=c , support(A>B)=P(AÂ¬B)=s.
The popular association rules Mining is to mine strong association rules that satisfy the user specified both minimum support threshold and confidence threshold. That is, minconfidence and minsupport. If support(X)
minsupport, X is frequent item sets. Frequent kitemsets is always marked as LK. If support (A>B) minsupport and confidence (A>B) minconfidence A>B is strong correlation. Several Theorems are introduced as follows:

If ABsupport(A)support(B).

If AB and A is nonfrequent itemsetthen B is non frequent itemset.

If AB and B is frequent itemset then A is frequent itemset.


Frequent Pattern (FP) Tree
A Frequent pattern tree (FPtree) is a type of prefix tree
[3] that allows the detection of recurrent (frequent) item set exclusive of the candidate item set generation [14]. It is anticipated to recuperate the flaw of existing mining methods. FP Trees pursue the divide and conquers tactic. The root of the FPtree is tag as NULL value. Childs of the roots are the set of item of data. Conventionally a FP tree contains three fields Item name, node link and count.To avoid numerous conditional FPtrees during mining of data author of [3] has proposed a new association rule mining technique using improved frequent pattern tree (FP tree) using table concept conjunction with a mining frequent item set (MFI) method to eliminate the redundant conditional FP tree.

Positive and Negative FP Rule Mining
Author of [15] cleverly explain the concept of positive and negative association rules. According to the [15] two indicators are used to decide the positive and negative of the measure:

Firstly find out the correlation according to the value
of
corrP,Q=s(PQ)/s(P)s(Q),which is used to delete the contradictory association rules emerged in mining process.
There are three measurements possible of corrP, Q [16]:

If corrP, Q>1, Then P and Q are related;

If corrP, Q=1, Then P and Q are independent of each other;

If corrP, Q<1, Then P and Q negative correlation;


Support and confidence is the positive and negative association rules in two important indicators of the measure.
The support given by the user to meet the minimum support (minsupport) a collection of itemsets called frequent itemsets, association rules mining to find frequent itemsets is concentrating on the needs of the user to set the minimum confidence level (minconf) association rules.
Negative association rules contains itemset does not exist (nonexistingitems, for example Â¬P, Â¬Q), Direct calclation of their support and confidence level more difficult.


LITERATURE SURVEY
Data mining is used to deal with size of data stored in the database, to extract the desired information and knowledge [3]. Data mining has various technique to perform data extraction association technique is the most effective data mining technique among them. It discover hidden or desired pattern among the large amount of data. It is responsible to find correlation relationships among different data attributes in a large set of items in a database. Since its introduction, this method has gained a lot of attention. Author of [3] has
analyzed that an association analysis [1] [5] [6] [7] is the discovery of hidden pattern or clause that occur repeatedly mutually in a supplied data set. Association rule finds relations and connection among data and data sets given.
An association rule [1] [5] [8] [9] is a law which necessitate certain relationship with the objects or items. Such associations rules are calculated from the data with help of the concept of probability.
Association mining using Apriori algorithm perform better but in case of large database it performs slow because it has to scan the full database each time while scanning the transaction as author of [4] surveyed.
Author of [3] has surveyed and conclude with the help of previous research in data mining using association rules has found that all the previously proposed algorithm like – Apriori [10], DHP [11], and FP growth [12] .
Apriori [6] employ a bottomup breadthfirst approach to discover the huge item set. The problem with this algorithm is that it cannot be applied directly to mine complex data [3]. Second wellknown algorithm is Frequent Pattern (FP) growth algorithm it takes up divideandconquer approach. FP computes the frequent items and forms in a tree of frequentpattern.
In comparison with Apriori algorithm FP is much superior in case of efficiency [13]. But problem with traditional FP is that it produces a huge number of conditional FP trees [3].

IMPROVED ASSOCIATION RULE MINING WITH CORREATION TECHNIQUE
Existing work based on Apriori algorithm for finding frequent pattern to generate association rules then apply class label association rules where this work uses FP tree with growth for finding frequent pattern to generate association rules. Apriori algorithm takes more time for large data set where FP growth is time efficient to find frequent pattern in transaction.
In this paper we have propose a new dimension into the data mining technique. For this we have integrated the concept of positive and negative association rules into the frequent pattern (FP) method. Negative and positive rules works better for than traditional association rule mining and FP cleverly works in large database. Our proposed method work as follows

Positive and Negative class association rules based on FP tree
This algorithm has two stages: rule generation and classification. In the first stage: the algorithm calculate the whole set of positive and negative class association rules such that sup(R) support and conf(R) confidence given thresholds. Furthermore, the algorithm prunes some contradictory rules and only selects a subset of high quality rules for classification.
In the second stage: classification, for a given data object, the algorithm extracts a subset of rules fund in the first stage
matching the data object and predicts the class label of the data object by analyzing this subset of rules.

Generating Rules
To find rules for classification, the algorithm first mines the training dataset to find the complete set of rules passing certain support and confidence thresholds. This is a typical frequent pattern or association rule mining task. The algorithm adopts FP Growth method to fmd frequent itemset. FP Growth method is a frequent itemset mining algorithm which is fast. The algorithm also uses the correlation between itemsets to find positive and negative class association rules. The correlation between itemsets can be defined as:
corr(X,Y)=(sup(XY))/(sup(X)sup(Y)) X and Y are itemsets.
When corr(X, Y)>1, X and Y have positive correlation. When corr(X, Y)=1, X and Yare independent.
When corr(X, Y)<1, X and Y have negative correlation. Also when corr(X, Y)>1, we can deduce that corr(X, –
Y)<1 and corr( X,Y)<1.
So, we can use the correlation between itemset X and class label ci to judge the class association rules.
When corr(X, ci)> 1, we can deduce that there exists the positive class association rule X ci
When corr(X, ci)> 1, we can deduce that there exists the negative class association rule Xci
So, the first step is to generate all the frequent itemsets by making multiple passes over the data. In the first pass, it counts the support of individual itemsets and determines whether it is frequent. In each subsequent pass, it starts with the seed set of itemsets found to be frequent in the previous pass. It uses this seed set to generate new possibly frequent itemsets, called candidate itemsets. The actual supports for these candidate itemsets are calculated during the pass over the data. At the end of the pass, it determines which of the candidate itemsets are actually frequent.
The algorithm of generating frequent itemsets is shown as follow:

Definition FPtree: A frequentpattern tree (or FP tree) is a tree structure defined below.

It consists of one root labeled as null, a set of itemprefix subtrees as the children of the root, and a frequentitemheader table.

Each node in the itemprefix subtree consists of three fields: itemname, count, and odelink, where itemname registers which item this node represents, count registers the number of transactions represented by the portion of the path reaching this node, and nodelink links to the next node in the FPtree carrying the same itemname, or null if there is none.

Each entry in the frequentitemheader table consists of two fields, (1) itemname and (2) head of nodelink (a pointer pointing to the first node in the FPtree carrying the itemname).
Based on this definition, we have the following FPtree construction algorithm.


Algorithm for FPtree construction
Input: A transaction database DB and a minimum support threshold .
Output: FPtree, the frequentpattern tree of DB.
Method: The FPtree is constructed as follows.

Scan the transaction database DB once. Collect F, the set of frequent items, and the support of each frequent item. Sort F in support descending order as FList, the list of frequent items.

Create the root of an FPtree, T, and label it as null. For each transaction Trans in B do the following

Select the frequent items in Trans and sort them according to the order of FList. Let the sorted frequentitem list in Trans be [p  P], where p is the first element and P is the remaining list. Call insert tree ([p  P], T).

The function insert tree([p  P], T ) is performed as follows. If T has a child N such that N.itemname = p.item name, then increment Ns count by 1; else create a new node N, with its count initialized to 1, its parent link linked to T , and its nodelink linked to the nodes with the same itemname via the nodelink structure. If P is nonempty, call insert tree (P, N) recursively.

Then, the next step is to generate positive and negative class association rules. It firstly finds the rules contained in F which satisfy min_sup and min_conf threshold. Then, it will determined the rules whether belong to the set of positve class correlation rules P_AR or the set of negative class correlation rules N_AR.

The algorithm of generating positive and negative class association rules is shown as follow:


Algorithm for generating positive and negativ class association rules
Input: training dataset T, min_sup, min_conf
Output: P_AR, N_AR (I)P_AR=NULL, N_AR=NULL;

for (any frequent itemset X in F and Ci in C)
{
if (sup(Xci)>min_sup and conf(X ci)> min_conf) if( corr(X, ci > 1)
{
P_AR= P_AR U {X – ci;};
}
else if corr(X, ci <I
{
N_AR= N_AR U {X – ci;};
}
}

returnP_AR and N_AR;
In this algorithm, we use FP Growth method generates the set of frequent itemsets F, In F, there are some itemsets passing certain support and confidence thresholds. And the correlation between itemsets and class labels is used as an important criterion to judge whether or not the correlation rule is positve. Lastly, P_AR and N_AR are returned.




Classification
After P_AR and N_AR are selected for classification, the algorithm is ready to classify new objects. Given a new data object, the algorithm collects the subset of rules matching the new object. In this section, we discuss how to determine the class label based on the subset of rules.
First, the algorithm finds all the rules matching the new object, generates PL set which includes all the positive rules from P _ AR and sorts the itemset by descending support values. The algorithm also generates NL set which includes all the negative rules from N_AR and sort the itemset by descending support values. Second, the algorithm will compare the positive rules in PL with the negative rules in NL and decides the class label of the data object.
The algorithm of classification is shown as follow:
a) Algorithm for classification Input: data object, P _AR, N_AR
Output: the class label of data object Cd
(1) PL=Sort(P_AR); NL=Sort{N_AR); i=j=I; (2)pJule=GetElem(pL, i); nJule=GetElem(NL,j); (3)while Ci<=PL_Length and j<=NL_Length)
{
if(RuleCompare(p _role, n _role))
{
if(P _role>n _role)
{
}
Cd = the label of p_role; Break;
if(P _role=n _role)
{
}
Cd = the label of p _role; break;
if(P _role<n _role)
{ j++;
}
}
if(!RuleCompare(pJule, nJule))
{
if(P Jule>n Jule)
{
Cd = the label ofpJule; break;
}
}
if(P _ rule=n_ rule)
{ i++; j++;
}
if(P _ rule<n _rule)
{ i++;
}
(4)return Cd;
In the algorithm of classification, the function Sort(P_AR) returns PL and the itemsets in PL are sorted by descending support values, the function GetElem(pL, i) returns first I rule in the set of PL. Also, we can deduce the returns of the function of Sort{N_AR) and GetElem{NL,j).


RESULTS AND PERFORMANCE MEASUREMENT
Proposed enhanced FP with positive and negative system has been implement using java technologies. Following results have been measured by the system.

Settings
File name
= data.num
File name
= data.num
Support (default 20%) = 20.0
Confidence (default 80%) = 80.0 Reading input file: data.num Number of records = 95
Number of columns = 38
Min support = 19.0 (records) Generation time = 0.0 seconds (0.0 mins) FP tree storage = 2192 (bytes)
FP tree updates = 694 FP tree nodes = 97
Support (default 20%) = 20.0
Confidence (default 80%) = 80.0 Reading input file: data.num Number of records = 95
Number of columns = 38
Min support = 19.0 (records) Generation time = 0.0 seconds (0.0 mins) FP tree storage = 2192 (bytes)
FP tree updates = 694 FP tree nodes = 97

FP Tree
(1) 9:90 (ref to null) (1.1.1.1.1) 1:72 (ref to 1:4)
(1.1.1.1.1.1) 32:65 (ref to 32:3)
And so on.
(1) 9:90 (ref to null) (1.1.1.1.1) 1:72 (ref to 1:4)
(1.1.1.1.1.1) 32:65 (ref to 32:3)
And so on.

GENERATING ARs:
Generation time = 0.17 seconds (0.0 mins) Ttree Storage = 8824 (Bytes)
Number of frequent sets = 626 [1] {9} = 90
[2] {19} = 90 [3] {19 9} = 85 [4] {23} = 90And so on.(Approximate 624 generated)

ASSOCIATION RULES
(1) {1 32 5} > {19} 100.0%
(2) {9 1 32 5} > {19} 100.0%
.
.
.
(102) {9 23 32 14 37} > {27} 100.0%
(1) {1 32 5} > {19} 100.0%
(2) {9 1 32 5} > {19} 100.0%
.
.
.
(102) {9 23 32 14 37} > {27} 100.0%
(103) {9 27 32 14 37} > {23} 100.0%
(104) {9 32 14 37} > {23 27} 100.0%
And so on( Approximate 7855 generated)
[3] A.B.M.Rezbaul Islam and TaeSun Chung An Improved Frequent
Pattern Tree Based Association Rule Mining Technique, IEEE, International Conference on Information Science and Applications
(ICISA), 2011
E. Possitive Class Itemsets RULES
[4] LUO XianWen and WANG WeiQing Improved Algorithms
{9 27 1 32 14} > {19}
{9 27 1 32 14} > {23}
Research for Association Rule Based on Matrix, IEEE, International Conference on Intelligent Computing and Cognitive Informatics,
{19 27 1 32 14} > {23}
[5] R Srikant, Qouc Vu and R Agrawal Mining Association Rules with
{9 19 27 1 32 14} > {23}
Item Constrains. IBM Research Centre, San Jose, CA 95120, USA.
{19 23 27 1 32 14} > {9}
[6] R Agrawal and R Srikant Fast Algorithm for Mining Association
And so on.
Rules. Proceedings of VLDB conference pp 487 449, Santigo,
Chile, 1994.
(103) {9 27 32 14 37} > {23} 100.0%
(104) {9 32 14 37} > {23 27} 100.0%
And so on( Approximate 7855 generated)
[3] A.B.M.Rezbaul Islam and TaeSun Chung An Improved Frequent
Pattern Tree Based Association Rule Mining Technique, IEEE, International Conference on Information Science and Applications
(ICISA), 2011
E. Possitive Class Itemsets RULES
[4] LUO XianWen and WANG WeiQing Improved Algorithms
{9 27 1 32 14} > {19}
{9 27 1 32 14} > {23}
Research for Association Rule Based on Matrix, IEEE, International Conference on Intelligent Computing and Cognitive Informatics,
{19 27 1 32 14} > {23}
[5] R Srikant, Qouc Vu and R Agrawal Mining Association Rules with
{9 19 27 1 32 14} > {23}
Item Constrains. IBM Research Centre, San Jose, CA 95120, USA.
{19 23 27 1 32 14} > {9}
[6] R Agrawal and R Srikant Fast Algorithm for Mining Association
And so on.
Rules. Proceedings of VLDB conference pp 487 449, Santigo,
Chile, 1994.
2010.
{9 14 37} > ~ {23 27}
{9 19 23 14 37} > ~ {27}
{9 19 14 37} > ~ {23 27}
{9 1 14 37} > ~ {23}
{9 19 1 14 37} > ~ {23}
And so on
{9 14 37} > ~ {23 27}
{9 19 23 14 37} > ~ {27}
{9 19 14 37} > ~ {23 27}
{9 1 14 37} > ~ {23}
{9 19 1 14 37} > ~ {23}
And so on
F. Negative Class Itemsets RULES
The result shows that the proposed system works more efficiently than exiting positive and negative using Apriory technique. We have evaluated that it can handle very large data set and able to mine efficiently. A current experiment shows that it can handle data 129941 KB of data. This statistics is chosen by us. Even our system can handle and genrate more mined data.


CONCLUSION

In this paper we have proposed a new hybrid approach to for data mining process. Data mining is the current focus of research since last decade due to enormous amount of data and information in modern day. Association is the hot topic among various data mining technique. In this article we have proposed a hybrid approach to deal with large size data. Proposed system is the enhancement of Frequent pattern (FP) technique of association with positive and negative integration on it. Traditional FP method performs well but generates redundant trees resulting that efficiency degrades. To achieve better efficiency in association mining positive and negative rules generation help out. Same concept has been applied in the proposed method. Results shows that propsed method perform well and can handle very large size of data set.
REFERENCES

Agrawal R,Imielinski T,Swami A, Mining Association Rules between Sets of Items in Large Databases, In:Proc of the ACM SIGMOD International conference on Management of Data,Washington DC,1993,pp,207216.

QI Zhenyu, XU Jing, GONG Dawei and TIAN He Traversing Model Design Based on Strongassociation Rule for Web Application Vulnerability Detection, IEEE, International Conference on Computer Engineering and Technology, 2009.

J. Han and M. Kamber.Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco, CA, 2001.

Ashok Savasere, E. Omiecinski and ShamkantNavatheAn Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the 21st VLDB conference Zurich, Swizerland, 1995.

Arun K Pujai Data Mining techniques, UniversityPress (India) Pvt. Ltd., 2001.

J. S. Park, M.S. Chen and P. S. Yu, An effective HashBased Algorithm for Mining Association Rules, Proceedings of the ACM SIGMOD, San Jose, CA, May1995, pp. 175186.

S. Brin, R. Motwani, J. D. Ullman and S. Tsur, Dynamic Item set Counting and Implication Rules for Market Basket Data, Proceedings of the ACM SIGMOD, Tucson, AZ, May 1997, pp. 255 264.

J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proceedings of the ACM SIGMOD, Dallas, TX, May 2000, pp. 112.

Qin Ding and gnanasekaran Sundaraj Association rule mining from XML data, Proceedings of the conference on data mining.DMIN06.

C. Silverstein, S. Brin, and R. Motwani, Beyond Market Baskets: Generalizing Association Rules to Dependence Rules, Data Mining and Knowledge Discovery, 2(1), 1998, pp 3968.

Yanguang Shen, Jie Liu and Jing Shen The Further Development of Weka Base on Positive and Negative Association Rules, IEEE, International Conference on Intelligent Computation Technology and Automation (ICICTA), 2010.

Yanguang Shen,Jie Liu,Fangping Li.Application Research on Positive and Negative Association Rules Oriented Software Defects,2009 International Conference on Computational Intelligence and Software Engineering(CISE 2009)[C].Wuhan,China,December 1113,2009.

LuoJunwei and Luo Huimin Algorithm for Classification Based on Positive and Negative Class Association Rules, 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), 2010.

Wenmin Li, Jiawei Han and Jian Pei, "CMAR: Accurate and Efficient Classification Based on Multiple Class Association Rules," Proceedings of the 2001 IEEE International Conference on Data Mining, IEEE Press, Dec. 2001, pp. 123131.