 Open Access
 Total Downloads : 186
 Authors : Manish K. Gupta, Rupak Kumar, Nitin K. Singh
 Paper ID : IJERTV2IS50723
 Volume & Issue : Volume 02, Issue 05 (May 2013)
 Published (First Online): 21052013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Generating Effective Pattern from Skewed DataSet
1Manish K. Gupta M.Tech Scholar 
2Rupak Kumar Sr. Lecturer 
3Nitin K. Singh IFTM University 
AFSET 
AFSET 
Moradabad, India 
Faridabad, India. 
Faridabad, India. 
Abstract
Most of the existing algorithms rely mainly on the spatial position of data items, thus generating patterns which are least required. Thus these algorithms prove to be not very effective for the purpose of mining interest patterns. Such problems persist in the conventional algorithms because they do not have predefined knowledge of the affinity among the data items. In this work a solution to this problem is proposed which is entirely based on the strong affinity between the data items. An algorithm has been proposed which help in finding the optimized pattern which are interested patterns for extracting value defined by the user and uses this threshold value to perform the extraction of optimized pattern from the item sets.
Keywords: Support, Candidate pattern, O Conf, maximal_ Optimized_ Pattern, Data items.

Introduction
In Skewed dataset, means those datasets which consist of large number of data items, belonging to different levels of support, if we use conventional clustering algorithms for mining associated patterns then they will not be effective. Example of skewed
dataset could be the large range of commodities in any shopping mall. These large ranges of commodities may belong to entirely different price level or may also belong to the same price level. Most of the conventional clustering algorithms rely entirely on support based pruning strategy and this strategy when used on highly skewed data as defined previously proves to be in effective because of the following two reasons.

If minimum threshold value is lowered, then the number of extracted patterns increases. Such spurious patterns contain data items belonging to different support level. These spurious patterns are called cross support patterns and the data items contained in such spurious patterns are weakly correlated which each other. More ever, using a lower value for minimum threshold also increase the computational and memory requirement substantially.

If we use a large value for minimum threshold, just opposite of the first case, then there are high chance that interested patterns having support less than the minimum threshold value may be missed.
Such problem lead to the need for searching measure which perform mining by using substantially lower values of support level and able to remove spurious associations among items with substantially different support levels.
Generally the clustering algorithms do not have knowledge regarding the affinity between the data items. They simply follow their notion and consequently prune the patterns to form clusters having data items many of which having least affinity between them.
The proposed solution will be to have a
measure that can efficiently identify useful patterns even at low levels of support and can be used to
Thus to solve this problem optimized clustering is an approach. In this approach, patterns are preserved such that the data items belonging to a particular pattern always belongs to a particular cluster. Optimizedpatternsare patterns which contains data items which have high affinity with each other. By high affinity in optimized patterns means that the presence of each and every data item in that optimizedpattern highly implies the presence of each and every other data item belonging to that same optimizedpattern.
The optimizedconfidence or (oconfidence) of an itemset I={i1,i2,i3,.,im}, is denoted as o conf(I), is a measure that reflects the overall affinity among items within the itemset. This measure is
defined as min{conf{i i ,..,i },
1 2 m
automatically remove spurious patterns during the
conf{i i1,i ,..i }, ., conf{i i ,i ,i –
2 3 m m 1 2 m
association mining process. For achieving this goal crosssupport property is being used for eliminating the spurious patterns which contains data items having substantially different levels.


Technique for maximal optimized patterns
As we already know that, clustering is a process of assigning various objects to various clusters, keeping in mind that objects or data items belonging to a certain cluster has higher similarity or higher affinity with each other. Thus, provides patterns for the purpose of discovering knowledge. If some of the data items belonging to that particular cluster moves from that particular cluster to some other cluster because of the clustering algorithm being used then it would become very difficult to gather knowledge from such clusters. Even more it is much easy to gather knowledge from the well understood patterns rather than interpreting the data items directly.
1}}, where conf is the conventional definition of association rule confidence.
The scope of optimizedconfidence could be understood properly with the help of following example. Consider an itemset I= {Bread, Butter, Milk}. Assume that supp({Bread})=0.1, supp({Butter})=0.1, supp({Milk})=0.06, and supp({Bread, Butter, Milk
})=0.06, where supp is the support of an itemset. Then
conf{BreadButter,Milk}=supp({Bread,Butter,Milk})/ supp({Bread})=0.6
conf{ButterBread,milk}=supp({Bread,Butter,Milk})/ supp({Butter})=0.6
conf{MilkBread,Butter}=supp({Bread,Butter,Milk})/ supp({Milk})=1
Hence, oconf(I)=min{conf{ButterBread,Milk}, conf{BreadButter,Milk}, conf{MilkBread,Butter}}=0.6.
The collection of candidate patterns from the itemset(I) is a optimizedpattern if and only if, the value of oconf(I)>= Tc, Tc is the minimum threshold confidence which is provided by the user. Further if for any optimizedpattern there exist some subset of this optimizedpattern, then this subset pattern should be removed from the set of all optimizedpattern. The reason for this is due to the property of allconfidence [10].
2.1 Properties of o confidence measure
The Oconfidence measure has four important properties, namely the antimonotone property, the crosssupport property, the strong affinity property and the allconfidence property.
2.1.1 AntiMonotone
The Oconfidence measure posses anti monotone property. This property states that if for all the data items belonging to P, the value of O confidence is greater than the threshold value Tc, then for all the subsets of P, the value of Oconfidence will remain greater than the threshold value Tc. How O confidence measure uses this property of anti monotone? This could be easily explained with the help of the following example: Suppose the supp({milk}) = 0.2, supp({sugar}) = 0.6 and the supp({milk,sugar})=
0.3 and the value of minimum oconfidence threshold is 0.6, then the oconfidence of the candidate pattern
{milk,sugar} is given by supp({milk,sugar})/ max{supp({milk}),supp({sugar})}= 0.3/.6 = 0.5 which is less than the minimum oconfidence of 0.6. Thus the candidate pattern {milk,sugar} is not a optimized pattern. Moreover, all the candidate patterns having
{milk,sugar} as their subset are pruned, like
{milk,sugar,biscuit} is not a optimized pattern. One thing should be noted down here, the pruning here is done on the basis of Oconfidence threshold. If the value of Oconfidence threshold is reducedto .45, then
{milk,sugar} will be a optimized pattern.

Strong Affinity
The Oconfidence measure also posses the property of Strong Affinity. The Oconfidence measure take cares that all the data items contained in a data set have strong affinity with each other. By strong affinity we means to say strong association between each other. This could be easily understood with the help of following consideration. Suppose the value of O confidence is 90% for any itemset (D). Then if any of the data item belonging to the itemset (D) occurs in any transaction, then there are 90% chances that the remaining data items belonging to the same itemset (D) will also occur in the same transaction.

Crosssupport patterns
The Oconf helps in minimizing the cross support patterns which are actually the spurious patterns. It is always very difficult to choose the right threshold value for the purpose of mining the large collection of data. If we set a very high value of threshold then there are chances that we may miss many interesting patterns. Conversely, if we set a very low value for the threshold then also it may not be easy to find the interested associated patterns because of the following two reasons. The first reason is that the computational and memory requirements of existing analysis algorithm increases considerably and secondly, the number of extracted patterns also increases substantially. The Oconf helps us in eliminating patterns which consists of data items which are not of interest. Also, Oconf does not involve extra
computational cost as it simply depends on the support values of the individual data items or their various combinations. This could be easily understood with the help of following consideration. Suppose Tc is the given value of threshold and P is a pattern such that P=
{I1, I2,.,,In}. We could say P as a crosssupport pattern with respect to Tc,if for any two data items suppose I1 and I2 belonging to P, the value of supp({I1})/supp({I2}) < Tc, where 0<Tc<1.

All Confidence
Omiecinski proposed the concept of all confidence[10] as an alternative to the support. All confidence represents the minimum confidence of all the association rules extracted from the itemset. Omiecinskis allconfidence posses the desirable property of antimonotone.
The allconfidence measure for an itemset P =
{i1, i2,.., im} is given by min({conf(AB  for all A, B is subset of P, AUB = P, AB = Ã˜ }) and is equal to :
Supp({i1,i2,,im})/ max
1<=k<=m{supp({ik})} (5.1)

Algorithm for maximal optimized pattern
Input
I: Item Set stored in database containing list of transactions with their items and corresponding
support
Min_threshold: Minimum Threshold value of o confidence
** Note the value of Min_threshold will be provided by the user.
Variable
Optimized : Optimized Pattern Set
Maximal_Optimized: Maximal Optimized Pattern Set
Optimized_Pattern_Evaluation() : Function for evaluating Optimized Pattern Set
Maximal_ Optimized_Pattern_Evaluation() : Function for evaluating Maximal Optimized Pattern Set
Method
I: Extracting Maximal Optimized Pattern
Optimized= Optimized_Pattern_Evaluation(I, Min_threshold)
{

for each element in (I) access the support value

Create candidate patterns with items belonging to different level of support

Perform pruning based on Anti monotone property

Perform pruning based on cross support property

Optimized patterns (i.e. Optimized) with Oconfidence > Min_threshold
}
Maximal_Optimized= Maximal_Optimized_Pattern_Evaluation(Optimized)
{

Find an optimized patterns (X) such that X is a subset of Y and both X,Y Optimized

Optimized = Optimized X

Reapeat until steps 1 2 until there exist no X, X subset of Y and both belonging to the Optimized Patterns.

Set Maximal_Optimized = Optimized

}
II: Performing Clustering
3. Explanation of the Algorithm

I provides the complete set of data items stored in the database on which the mining has to be performed. And Min_threshold defines the value of minimum threshold confidence as explained by the user.

Optimized and Optimized_Pattern_Evaluation is the variable for storing optimized pattern and the function for calculating the optimized patterns respectively. Similarly Maximal_Optimized and Maximal_Optimized_Pattern_Evaluation is the variable for storing maximal optimized pattern and the function for calculating the maximal optimized patterns respectively.

Optimized_Pattern_Evaluation() is a function used for calculating the Optimized patterns. It takes input item set and min_threshold value as the input parameter. And using them produces the optimized patterns with different level of support are created. After this, pruning is performed on these optimized patterns satisfying the property of AntiMonotone and crosssupport. These set of optimized patterns may consist of redundant patterns i.e. one pattern could be subset of other pattern. These patterns providing the redundant information
need to be removed. These are removed by using another function
Maximal_Optimized_Pattern_Evaluation(). This function takes optimized as its input parameter and thus helps in removing this problem.

Conclusion
This algorithm can even be used on highly skewed item set and proves to be beneficial by removing spurious patterns and generating strong affinity patterns. While on the other hand if we use conventional clustering algorithms for mining associated patterns then they will not be effective. Most of the clustering algorithms defined so far rely purely on supportbased pruning strategy and this strategy when used on highly skewed data sets proves to be ineffective.
Reference

J. Han, and M. Kamber, 2000. Data Mining Concepts and Techniques. Morgan Kanufmann.

J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In: Proc. Conf. on the Management of Data (SIGMOD00, Dallas, TX). ACM Press, New York, NY, USA 2000.

J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequentpattern tree approach. Data Mining and Knowledge Discovery, 2003.

C. Borgelt. An Implementation of the FP growth Algorithm. Proc. Workshop Open
Software for Data Mining (OSDM05 at KDD05, Chicago,IL),15.ACMPress, New York, NY, USA 2005.

C. Borgelt. SaM: Simple Algorithms for Frequent Item Set Mining. IFSA/EUSFLAT 2009 conference 2009.

Berglund, E., Sitte, J.: The parameterless self organizing map algorithm. IEEE Transactions on Neural Networks, vol. 17, no. 2, 2006, pp. 305316.

Van Hulle, M.: Faithful Representations and Topographic Maps: From Distortion to InformationBased SelfOrganization, John Wiley, New York, 2000.

Kohonen, T.: SelfOrganizing Maps. 3rd ed., Springer, Berlin, 2001.

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys., vol. 21, no. 6, 1953, pp. 10871092.

Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, vol. 7, 1936, pp. 179188.