A Comparative Study of K-Means And Weighted K-Means for Clustering

DOI : 10.17577/IJERTV1IS10227

Download Full-Text PDF Cite this Publication

Text Only Version

A Comparative Study of K-Means And Weighted K-Means for Clustering

Anand M. Baswade M.Tech. Student of CSE Dept. SGGSIE&T, Nanded, India

Kalpana D. Joshi M.Tech. Student of CSE Dept. SGGSIE&T, Nanded, India

Prakash S. Nalwade Associate Professor SGGSIE&T, Nanded, India

Abstract

k-mean [1] is one of the most important algorithm for Clustering. Major problem of using k-mean type algorithms in Data mining is selection of variables (attributes). k-meanalgorithm can not select variablesautomatically because they treat all variables equally in clustering process that result in poor clustering. To obtain good clustering results it is important to identify the subset of variables from all variables. So that subset of variables can be used for clustering. New k-mean type clustering algorithm called W-k-mean [2] that can automatically calculate variable weights. So the algorithm can be used as variable selection in data mining application where large and complex real data are often involved. This paper gives detail information about k-mean and W-k-mean algorithms and their performance.

Keywords: Data mining, clustering, feature evaluation and selection.

  1. Introduction

    Data clustering is a process of partitioning a set of objects into clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some define criteria. Means grouping the data into clusters according to their internal character, the element in each cluster should have as similar character as possible, the difference between clusters should be as big as possible. The k-means clustering algorithms [1],[3] are widely used in real world applications such as marketing research and data mining to cluster very large data sets due to their efficiency. Among clustering algorithms, k-mean clustering algorithm can be applied in many fields, including image and audio data compression, pre-process of system modelling

    with radial basis function networks, and task decomposition of heterogeneous neural network structure.

    There are some weaknesses in k-mean algorithm like selection of number of cluster, selection of initially centroids etc. but in this paper we discussed about some other problem of k-mean and solution to that problem is W-k-mean that is also discussed in this paper.

    A major problem of using the k-means algorithms in data mining is selection of variables. The k-means type algorithms cannot select variables automatically because they treat all variables equally in the clustering process. In practice, selection of variables for a clustering problem such as customer segmentation is often made based on understanding of the business problem and data to be used. Tens or hundreds of variables are usually extracted or derived from the database in the initial selection which forms a very high-dimensional space. It is well-known that an interesting clustering structure usually occurs in a subspace defined by subset of the initially selected variables. To find the clustering structure, it is important to identify the subset of variables [4],[5]. So the new algorithm which is proposed called W-k-mean helps to find out those variables which are most important and which are least important by calculating weightof variablesduring each iteration. So those variables which having least weight can be removed and clustering is performed on remaining variables to achieve good clustering results.

    The variable weights produced by W-k-means measure the importance of variables in clustering. The small weights reduce or eliminate the effect of insignificant (or noisy) variables. The weights can be used in variable selection in data mining applications where large and complex real data are often involved.

    In this paper we have studied the most important data mining algorithmk-means andextension of

    , = =1

    ,

    ,

    for 1 l k and 1 j m.

    k-means by adding weights to variables that is W-k-mean for clustering.

    =1

    ,

  2. k-Mean Algorithm

    The traditional K-mean algorithm is based on decomposition, most widely used in data mining field. The concept is use K as a parameter, Divide n object into K clusters, to create relatively high similarity in the cluster, relatively low similarity between clusters. And minimize the total distance between the values in each cluster to the cluster center. The cluster center of each cluster is the mean value of the cluster. The calculation of similarity is done by mean value of the cluster objects. The distance between the objects is calculated by using Euclidean distance. The closer the distance, bigger the similarity of two objects, and vice versa.

    Let = {1, 2, 3, }be the nobjects.

    , = {,1, ,2, , } be the m variables.

    = {1, 2, } be the k centroids.

    0

    , = 1 ;, is n x k matrix and it is 1 if objecti allocate in cluster l.

    1. k-Mean Algorithm

      Input: k, Data.

      Output: n objects into k clusters.

      Step 1. Choose k random positions in the input space. Step 2. Assign the cluster centres to those positions. Step 3. For each Data

      -compute distance d(, , , ) for each

      Step 5. Stopping criteria

      -No (or minimum) re-assignments of data points to different clusters.i.e. , Remain unchanged.

      OR

      – No (or minimum) change of centroids.

      i.e. Remain unchanged.

    2. Performance analysis

      1. Advantages:

        1. K-means is classical algorithm to resolveclustering problems, simple and quickly and it is easy to implement and Understand.

        2. Complexity of k-mean algorithm is O(ntk).where n is number of objects, t is number of iteration and k is number of cluster.

      2. Disadvantages:

        K-means only can be used under the situation that the average value has been defined. This may not suit some applications, such as mobile objects clustering, data concerned about classified attributes.

        1. In k-mean algorithm user need to specify the number of cluster that is k.

        2. It's sensitive to the initial centroids and change in initial centroids can lead to different clustering results with different initial value.

        3. k-means is not fit to non-convex cluster, or big difference on size. Besides, it's sensitive to noisy data and isolated points data, a little data like this can make huge effects on average values. In the other way we can say k-mean algorithm is unable to handle noisy data and outliers.

        The above listed disadvantages are discussed in

        =1

        d(, , , ) =

        |, , |2

        many research papers but there is one more major problem in k-mean that is the k-means type algorithms treat all variables equally in deciding the cluster

        -Assign to the cluster with the minimum distance.

        =1

        =1

        , =1 if (, , , ) (, , , )

        memberships of objects. This is not desirable in many applications such as data mining where data often contains a large number of diverse variables. A cluster

        , = 0 for t l

        for 1 t k

        structure in a given data set is often confined to a subset of variables rather than the entire variable set. Inclusion of other variables can only obscure the discovery of the cluster structure by a clustering

        Step 4. For each Move the position of to the mean of the points in that cluster:

        algorithm. To overcome tis major problem a new k mean type algorithm is designed that is W-k-mean.

  3. W-k-Mean Algorithm

    New k-mean type clustering algorithm called W-k-mean [2] that can automatically calculate variable weights. The variable weights produced by the algorithm measures the importance of the variable in clustering. So the algorithm can be used as variable selection in data mining application where large and complex real data are often involved. Select higher weighted variables and remove lower weighted variables for good clustering results.

    Let = { 1 , 2 , } be the weights for m

    =1

    variables such that =1.

    Objective of W-k-Mean is to Minimize:

    When = 1, is equal to 1 for the smallest value of

    .The other weights are equal to 0. Although the objective function is minimized, the clustering is made by the selection of one variable. It may not be desirable for high-dimensional clustering problems.

    When 0 < < 1, the larger , the larger , so does

    .This is against the variable weighting principal, so we cannot choose 0 < <1.

    When > 1, the larger , the smaller , and the smaller . The effect of variable with large is reduced.

    When <0, the larger , the larger wj However, w j becomes smaller and has less weighting to the variable in the distance calculation because of negative .

    From the above analysis, we can choose < 0 or > 1

    =1

    P(U,Z,W) =

    =1

    =1

    ,

    (,

    , , ).

    in the W-k -means algorithm

    Since the W-k -means algorithm is an extension to the k -means algorithm by adding a new step to

    1. W-k-Mean Algorithm

      Step 1. Randomly choose an initial 0 ={ 1 ,

      2, } and randomly generate a set of initial

      calculate the variable weights in the iterative process, it does not seriously affect the scalability of the k-means type algorithms in clustering large data; therefore, it is suitable for data mining applications. the principal for

      =1

      weights 0 ={ 10 , 20 , 0 } (

      =1).

      variable weighting is to assign a larger weight to a

      Determine 0 such that P(0 , 0 , 0 ) is minimized. set t=0;

      Step 2. Let = and = ,solve problem P(U, , ) to obtain +1.if P(+1 , , )= P( , , ), output (U, , ) and stop ; otherwise go to step 3;

      variable that has a smaller sum of the within cluster distances and a smaller one to a variable that has a larger sum of the within cluster distances.

      =1

      After getting weights of variables we can selects higher weighted variables and neglectlower weighted variables for better clustering results.MATLAB is used for implementing k-mean and W-k-mean algorithms

      =1

      , = 1 If

      , = 0 fort l

      (, , , )

      (, , , )

      for 1 t k

      and data sets obtained from UCI Machine Learning Repository [6] for comparing the results of k-mean and W-k-mean algorithm.

      Step 3. Let Û= +1 and = ,solve problem P(Û, Z, ) , to obtain +1 . If P(Û, +1 , )=P(Û, , ) , output (U, , ) and stop ; otherwise go to step 4;

  4. Conclusions

    A new k-means type algorithm called W-k-Means that can automatically weight variables based on the

    ,

    importance of the variables in clustering. By knowing

    = =1 ,

    for 1 l k and 1 j m

    ,

    =1

    ,

    the weight of variables we can select those variables which are good for better clustering results. This

    Step 4. Let Û= +1 and =+1 ,solve problem P(Û, , W) , to obtain +1. if P(Û, , +1)=P(Û,, ) , output (Û, , ) and stop ; otherwise, set t=t+1 and go to step 2.

    0 = 0

    capability is very useful in data mining application where large and complex dataset are often involved. The weights can be used to identify important variables for clustering. The variables which may contribute noise to the clustering process can be removed from data in future analysis.

    Where

    =

    =1

    1

    1 0

    1

  5. References

  1. J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observation, Proc. Fifth Berkeley

    >1 and 0, =

    (

    , )

    Symp.Math.Statistica and Probability, pp. 281-297, 1967.

    =1

    =1

    ,

    ,

    ,

  2. Joshua Zhexue Huang , Michael K. Ng , HongqiangRong

    , Zichen Li, Automated Variable Weighting in k-Means Type Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.27, n.5, pp.657-668, May 2005.

  3. Z. Huang, Extensions to the k-Means Algorithms for Clustering Large Data Sets with Categorical Values, Data Ming and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.

  4. E. Fowlkes, R. Gnanadesikan, and J. Kettenring, Variable Selection In Clustering, J. Classification, vol. 5, pp. 205-228, 1988.

[5]J.H. Friedman and J.J. Meulman, Clustering Objects on Subsets ofAttributes,J. Royal Statistical Soc. B., 2002.

Leave a Reply