 Open Access
 Total Downloads : 973
 Authors : Anand M. Baswade, Kalpana D. Joshi, Prakash S. Nalwade
 Paper ID : IJERTV1IS10227
 Volume & Issue : Volume 01, Issue 10 (December 2012)
 Published (First Online): 28122012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Comparative Study of KMeans And Weighted KMeans for Clustering
Anand M. Baswade M.Tech. Student of CSE Dept. SGGSIE&T, Nanded, India
Kalpana D. Joshi M.Tech. Student of CSE Dept. SGGSIE&T, Nanded, India
Prakash S. Nalwade Associate Professor SGGSIE&T, Nanded, India
Abstract
kmean [1] is one of the most important algorithm for Clustering. Major problem of using kmean type algorithms in Data mining is selection of variables (attributes). kmeanalgorithm can not select variablesautomatically because they treat all variables equally in clustering process that result in poor clustering. To obtain good clustering results it is important to identify the subset of variables from all variables. So that subset of variables can be used for clustering. New kmean type clustering algorithm called Wkmean [2] that can automatically calculate variable weights. So the algorithm can be used as variable selection in data mining application where large and complex real data are often involved. This paper gives detail information about kmean and Wkmean algorithms and their performance.
Keywords: Data mining, clustering, feature evaluation and selection.

Introduction
Data clustering is a process of partitioning a set of objects into clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some define criteria. Means grouping the data into clusters according to their internal character, the element in each cluster should have as similar character as possible, the difference between clusters should be as big as possible. The kmeans clustering algorithms [1],[3] are widely used in real world applications such as marketing research and data mining to cluster very large data sets due to their efficiency. Among clustering algorithms, kmean clustering algorithm can be applied in many fields, including image and audio data compression, preprocess of system modelling
with radial basis function networks, and task decomposition of heterogeneous neural network structure.
There are some weaknesses in kmean algorithm like selection of number of cluster, selection of initially centroids etc. but in this paper we discussed about some other problem of kmean and solution to that problem is Wkmean that is also discussed in this paper.
A major problem of using the kmeans algorithms in data mining is selection of variables. The kmeans type algorithms cannot select variables automatically because they treat all variables equally in the clustering process. In practice, selection of variables for a clustering problem such as customer segmentation is often made based on understanding of the business problem and data to be used. Tens or hundreds of variables are usually extracted or derived from the database in the initial selection which forms a very highdimensional space. It is wellknown that an interesting clustering structure usually occurs in a subspace defined by subset of the initially selected variables. To find the clustering structure, it is important to identify the subset of variables [4],[5]. So the new algorithm which is proposed called Wkmean helps to find out those variables which are most important and which are least important by calculating weightof variablesduring each iteration. So those variables which having least weight can be removed and clustering is performed on remaining variables to achieve good clustering results.
The variable weights produced by Wkmeans measure the importance of variables in clustering. The small weights reduce or eliminate the effect of insignificant (or noisy) variables. The weights can be used in variable selection in data mining applications where large and complex real data are often involved.
In this paper we have studied the most important data mining algorithmkmeans andextension of
, = =1
,
,
for 1 l k and 1 j m.
kmeans by adding weights to variables that is Wkmean for clustering.
=1
,

kMean Algorithm
The traditional Kmean algorithm is based on decomposition, most widely used in data mining field. The concept is use K as a parameter, Divide n object into K clusters, to create relatively high similarity in the cluster, relatively low similarity between clusters. And minimize the total distance between the values in each cluster to the cluster center. The cluster center of each cluster is the mean value of the cluster. The calculation of similarity is done by mean value of the cluster objects. The distance between the objects is calculated by using Euclidean distance. The closer the distance, bigger the similarity of two objects, and vice versa.
Let = {1, 2, 3, }be the nobjects.
, = {,1, ,2, , } be the m variables.
= {1, 2, } be the k centroids.
0
, = 1 ;, is n x k matrix and it is 1 if objecti allocate in cluster l.

kMean Algorithm
Input: k, Data.
Output: n objects into k clusters.
Step 1. Choose k random positions in the input space. Step 2. Assign the cluster centres to those positions. Step 3. For each Data
compute distance d(, , , ) for each
Step 5. Stopping criteria
No (or minimum) reassignments of data points to different clusters.i.e. , Remain unchanged.
OR
– No (or minimum) change of centroids.
i.e. Remain unchanged.

Performance analysis

Advantages:

Kmeans is classical algorithm to resolveclustering problems, simple and quickly and it is easy to implement and Understand.

Complexity of kmean algorithm is O(ntk).where n is number of objects, t is number of iteration and k is number of cluster.


Disadvantages:
Kmeans only can be used under the situation that the average value has been defined. This may not suit some applications, such as mobile objects clustering, data concerned about classified attributes.

In kmean algorithm user need to specify the number of cluster that is k.

It's sensitive to the initial centroids and change in initial centroids can lead to different clustering results with different initial value.

kmeans is not fit to nonconvex cluster, or big difference on size. Besides, it's sensitive to noisy data and isolated points data, a little data like this can make huge effects on average values. In the other way we can say kmean algorithm is unable to handle noisy data and outliers.
The above listed disadvantages are discussed in
=1
d(, , , ) =
, , 2
many research papers but there is one more major problem in kmean that is the kmeans type algorithms treat all variables equally in deciding the cluster
Assign to the cluster with the minimum distance.
=1
=1
, =1 if (, , , ) (, , , )
memberships of objects. This is not desirable in many applications such as data mining where data often contains a large number of diverse variables. A cluster
, = 0 for t l
for 1 t k
structure in a given data set is often confined to a subset of variables rather than the entire variable set. Inclusion of other variables can only obscure the discovery of the cluster structure by a clustering
Step 4. For each Move the position of to the mean of the points in that cluster:
algorithm. To overcome tis major problem a new k mean type algorithm is designed that is Wkmean.




WkMean Algorithm
New kmean type clustering algorithm called Wkmean [2] that can automatically calculate variable weights. The variable weights produced by the algorithm measures the importance of the variable in clustering. So the algorithm can be used as variable selection in data mining application where large and complex real data are often involved. Select higher weighted variables and remove lower weighted variables for good clustering results.
Let = { 1 , 2 , } be the weights for m
=1
variables such that =1.
Objective of WkMean is to Minimize:
When = 1, is equal to 1 for the smallest value of
.The other weights are equal to 0. Although the objective function is minimized, the clustering is made by the selection of one variable. It may not be desirable for highdimensional clustering problems.
When 0 < < 1, the larger , the larger , so does
.This is against the variable weighting principal, so we cannot choose 0 < <1.
When > 1, the larger , the smaller , and the smaller . The effect of variable with large is reduced.
When <0, the larger , the larger wj However, w j becomes smaller and has less weighting to the variable in the distance calculation because of negative .
From the above analysis, we can choose < 0 or > 1
=1
P(U,Z,W) =
=1
=1
,
(,
, , ).
in the Wk means algorithm
Since the Wk means algorithm is an extension to the k means algorithm by adding a new step to

WkMean Algorithm
Step 1. Randomly choose an initial 0 ={ 1 ,
2, } and randomly generate a set of initial
calculate the variable weights in the iterative process, it does not seriously affect the scalability of the kmeans type algorithms in clustering large data; therefore, it is suitable for data mining applications. the principal for
=1
weights 0 ={ 10 , 20 , 0 } (
=1).
variable weighting is to assign a larger weight to a
Determine 0 such that P(0 , 0 , 0 ) is minimized. set t=0;
Step 2. Let = and = ,solve problem P(U, , ) to obtain +1.if P(+1 , , )= P( , , ), output (U, , ) and stop ; otherwise go to step 3;
variable that has a smaller sum of the within cluster distances and a smaller one to a variable that has a larger sum of the within cluster distances.
=1
After getting weights of variables we can selects higher weighted variables and neglectlower weighted variables for better clustering results.MATLAB is used for implementing kmean and Wkmean algorithms
=1
, = 1 If
, = 0 fort l
(, , , )
(, , , )
for 1 t k
and data sets obtained from UCI Machine Learning Repository [6] for comparing the results of kmean and Wkmean algorithm.
Step 3. Let Ã›= +1 and = ,solve problem P(Ã›, Z, ) , to obtain +1 . If P(Ã›, +1 , )=P(Ã›, , ) , output (U, , ) and stop ; otherwise go to step 4;


Conclusions
A new kmeans type algorithm called WkMeans that can automatically weight variables based on the
,
importance of the variables in clustering. By knowing
= =1 ,
for 1 l k and 1 j m
,
=1
,
the weight of variables we can select those variables which are good for better clustering results. This
Step 4. Let Ã›= +1 and =+1 ,solve problem P(Ã›, , W) , to obtain +1. if P(Ã›, , +1)=P(Ã›,, ) , output (Ã›, , ) and stop ; otherwise, set t=t+1 and go to step 2.
0 = 0
capability is very useful in data mining application where large and complex dataset are often involved. The weights can be used to identify important variables for clustering. The variables which may contribute noise to the clustering process can be removed from data in future analysis.
Where
=
=1
1
1 0
1

References

J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observation, Proc. Fifth Berkeley
>1 and 0, =
(
, )
Symp.Math.Statistica and Probability, pp. 281297, 1967.
=1
=1
,
,
,

Joshua Zhexue Huang , Michael K. Ng , HongqiangRong
, Zichen Li, Automated Variable Weighting in kMeans Type Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.27, n.5, pp.657668, May 2005.

Z. Huang, Extensions to the kMeans Algorithms for Clustering Large Data Sets with Categorical Values, Data Ming and Knowledge Discovery, vol. 2, no. 3, pp. 283304, 1998.

E. Fowlkes, R. Gnanadesikan, and J. Kettenring, Variable Selection In Clustering, J. Classification, vol. 5, pp. 205228, 1988.
[5]J.H. Friedman and J.J. Meulman, Clustering Objects on Subsets ofAttributes,J. Royal Statistical Soc. B., 2002.