 Open Access
 Total Downloads : 796
 Authors : Anand Sutariya, Kiran Amin
 Paper ID : IJERTV2IS1385
 Volume & Issue : Volume 02, Issue 01 (January 2013)
 Published (First Online): 30012013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
An Improvement in Kmeans Clustering Algorithm
Anand Sutariya1, Prof. Kiran Amin2
1PG Student, U.V.Patel College of Engineering, Ganpat University, Mehsana, Gujarat
2Head, CE Dept., U.V.Patel College of Engineering, Ganpat University, Mehsana, Gujarat
Abstract
Data Mining is the process of extracting information from large data sets through the use of algorithms and techniques drawn from the field of Statistics, Machine Learning and Data Base Management Systems. Cluster analysis is one of the major data analysis methods and the kmeans clustering algorithm is widely used for many practical applications. But the original kmeans algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed for improving the performance of the k means clustering algorithm. But still there are many problems in original kmeans algorithm. So, we have proposed the improved algorithm of kmeans for improving the performance of the algorithm.

Introduction
Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.
Clustering is the grouping of similar objects and a cluster of a set is a partition of its elements that is chosen to minimize some measure of dissimilarity [1]. Unlike classification which is a supervised learning technique, clustering is a type of unsupervised learning. In clustering, objects in the dataset are grouped into clusters, such that groups are very different from each other and the objects in the same group are very similar to each other. In this case, clusters are not predefined which means that result clusters are not known before the execution of clustering algorithm. These clusters are extracted from the dataset by grouping the objects in it. For some algorithms, number of desired clusters is supplied to the algorithm, whereas some others determine the number of groups themselves for the best clustering result. Clustering of a dataset gives
information on both the overall dataset and characteristics of objects in it [2].

Partitioning Methods
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster. That is, it classifies the data into k groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group. Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion of a good partitioning is that objects in the same cluster are close or related to each other, whereas objects of different clusters are far apart or very different. There are various kinds of other criteria for judging the quality of partitions.

Hierarchical Methods
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottomup approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the topdown approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.

Kmeans Algorithm
Kmeans is a data mining algorithm which performs clustering. As mentioned previously, clustering is dividing a dataset into a number of groups such that similar items fall into same groups [1]. Clustering uses unsupervised learning technique which means that result clusters are not known before the execution of clustering algorithm unlike the case in classification. Some clustering algorithms takes the number of desired clusters as input while some others decide the number of result clusters themselves.

means algorithm uses an iterative procedure in order to cluster database [3]. It takes the number of desired clusters and the initial means as inputs and produces final means as output. Mentioned initial and final means are the means of clusters. If the algorithm is required to produce K clusters then there will be K initial means and K final means. In completion, K means algorithm produces K final means which answers why the name of algorithm is Kmeans.
After termination of Kmeans clustering, each object in dataset becomes a member of one cluster. This cluster is determined by searching throughout the means in order to find the cluster with nearest mean to the object. Shortest distanced mean is considered to be the mean of cluster to which examined object belongs. Kmeans algorithm tries to group the items in dataset into desired number of clusters. To perform this task it makes some iteration until it converges. After each iteration, calculated means are updated such that they become closer to final means. And finally, the algorithm converges and stops performing iterations.
Steps of Algorithm:
Input:
D = {d1, d2,……,dn} //set of n data items.
k // Number of desired clusters Output: A set of k clusters.
Steps:

Arbitrarily choose k dataitems from D as initial centroids;

Repeat

Assign each item di to the cluster which has the closest centroid;
Calculate new mean for each cluster; Until convergence criteria is met.



Related Work
Fang Yuan et al. [4] proposed a systematic method for finding the initial centroids. The centroids obtained by this method are consistent with the distribution of data. Hence it produced clusters with better accuracy, compared to the original k means algorithm. However, Yuans method does not suggest any improvement to the time complexity of the kmeans algorithm.
Sun Shibao et al. [5] proposed an improved kmeans algorithm based on weights. This is a new partitioning clustering algorithm, which can handle the data of numerical attribute, and it also can handle the data of symbol attribute. Meanwhile, this method reduces the impact of isolated points and the noise, so it enhances the efficiency of clustering. However, this method has no improvement on the complexity of time.
Kmeans algorithm is a popular partition algorithm in cluster analysis, which has some limitations when there are some restrictions in computing resources and time, especially for huge size dataset. YuFang Zhang et al. [6] proposed the improved Kmeans algorithm is a solution to handle large scale data. However, this method has no improvement on the complexity of time.
Kmeans algorithm has higher efficiency and scalability and converges fast when dealing with large data sets. However it also has many deficiencies: the number of clusters K needs to be initialized, the initial cluster centers are arbitrarily selected, and the algorithm is influenced by the noise points. In view of the shortcomings of the traditional Kmeans clustering algorithm, Juntao Wang et al. [7] proposed a improved Kmeans algorithm using noise data filter. The algorithm developed densitybased detection methods based on characteristics of noise data where the discovery and processing steps of the noise data are added to the original algorithm. However, this method has no improvement on the complexity of time.
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. Zhexu Huang [10] proposed two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. However, this method has no improvement on the complexity of time as well as final cluster.

Proposed Solution
Based on above related work we can figure out that all the algorithms are suffers in term of time complexity as well as accuracy.
So, we have proposed new Improved Kmeans algorithm which can handle the above problem very well.
Improved Kmeans Algorithm Steps of Algorithm:
Input:
D = {d1, d2,……,dn} // set of n data items k // Number of desired clusters
Output: A set of k clusters. Steps:
Phase 1: Determine the initial centroids of the clusters by using Phase 1.
Phase 2: Assign each data point to the appropriate clusters by using Phase 2
In the first phase, the initial centroids are determined systematically so as to produce clusters with better accuracy. The second phase assigns each data point to the appropriate clusters. The two phases of the improved method are described below as Phase 1 and Phase 2.
Phase 1: Finding the initial centroids.
Input:
D = {d1, d2,……,dn} // set of n data items k // Number of desired clusters
Output: A set of k initial centroids. Steps:

Set m = 1;

Compute the distance between each data point and all other data points in the set D;

Find the closest pair of data points from the set D and form a datapoint set Am (1<= m <= k) which contains these two data points, Delete these two data points from the set D;

Find the data point in D that is closest to the data point set Am, Add it to Am and delete it from D;

Repeat step 4 until the number of data points in Am reaches 0.75*(n/k);

If m<k, then m = m+1, find another pair of data points from D between which the distance is the shortest, form another data point set Am and delete them from D, Go to step 4;

For each datapoint set Am (1<=m<=k) find the arithmetic mean of the vectors of data points in Am, these means will be the initial centroids.
Phase 1 [4] describes the method for finding initial centroids of the clusters. Initially, compute the distances between each data point and all other data points in the set of data points. Then find out the closest pair of data points and form a set A1 consisting of these two data points, and delete them from the data point set

Then determine the data point which is closest to the set A1, add it to A1 and delete it from D. Repeat this procedure until the number of elements in the set A1 reaches a threshold. At that point go back to the second step and form another datapoint set A2. Repeat this till k such sets of data points are obtained. Finally the initial centroids are obtained by averaging all the vectors in each datapoint set. The Euclidean distance is used for determining the closeness of each data point to the cluster centroids. The distance between one vector X = (x1, x2, ….xn) and another vector Y = (y1, y2,
.yn) is obtained as,
The distance between a data point X and a data point set D is defined as
d(X, D) = min (d (X, Y), where Y D).
The initial centroids of the clusters are given as input to the second phase, for assigning datapoints to appropriate clusters. The steps involved in this phase are outlined as Phase 2.
Phase 2: Assigning data points to clusters.
Input:
D = {d1, d2,……,dn} // set of n datapoints.
C = {c1, c2,…….,ck} // set of k centroids
Output: A set of k clusters Steps:

Compute the distance of each datapoint di (1<=i<=n) to all the centroids cj (1<=j<=k) as d(di, cj);

For each datapoint di, find the closest centroid cj and assign di to cluster j.

Set ClusterId[i] = j; // j: Id of the closest cluster

Set Nearest Dist[i] = d(di, cj);

For each cluster j (1<=j<=k), recalculate the centroids;

Repeat

For each datapoint di,

Compute its distance from the centroid of the present nearest cluster;

If this distance is less than or equal to the present nearest distance, the datapoint stays in the cluster;
Else

For every centroid cj (1<=j<=k) Compute the distance d(di, cj); Endfor;

Assign the datapoint di to the cluster with the nearest centroid cj.

Set ClusterId[i] = j;

Set Nearest_Dist[i] = d(di, cj);
Endfor;



For each cluster j (1<=j<=k), recalculate the centroids; until the convergence criteria is met.

The first step in Phase 2 is to determine the distance between each datapoint and the initial centroids of all the clusters. The datapoints are then assigned to the
clusters having the closest centroids. This results in an initial grouping of the datapoints. For each datapoint, the cluster to which it is assigned (ClusterId) and its distance from the centroid of the nearest cluster (Nearest_Dist) are noted. Inclusion of datapoints in various clusters may lead to a change in the values of the cluster centroids. For each cluster, the centroids are recalculated by taking the mean of the values of its datapoints. Up to this step, the procedure is almost similar to the original kmeans algorithm except that the initial centroids are computed systematically.
The next stage is an iterative process which makes use of a heuristic method to improve the efficiency. During the iteration, the datapoints may get redistributed to different clusters. The method involves keeping track of the distance between each datapoint and the centroid of its present nearest cluster. At the beginning of the iteration, the distance of each data point from the new centroid of its present nearest cluster is determined. If this distance is less than or equal to the previous nearest distance, that is an indication that the data point stays in that cluster itself and there is no need to compute its distance from other centroids. These results in the saving of time required to compute the distances to k1 cluster centroids. On the other hand, if the new centroid of the present nearest cluster is more distant from the datapoint than its previous centroid, there is a chance for the data point getting included in another nearer cluster. In that case, it is required to determine the distance of the data point from all the cluster centroids. Identify the new nearest cluster and record the new value of the nearest distance. The loop is repeated until no more datapoints cross cluster boundaries, which indicates the convergence criterion. The heuristic method described above results in significant reduction in the number of computations and thus improves the efficiency.

Experiment and Results
Improved algorithm of the Kmeans has been designed and implemented in this project for the purpose of improvement of Kmeans algorithm in execution time and to get the better accuracy. Original Kmeans algorithm has also been implemented for the purpose of comparison with Improved Kmeans algorithm in time and accuracy. Both implementations have been tested on the same environment which is JAVA Programming Language.
The multivariate and sequential Libras Movemet dataset has 360 instances and 90 attributes, iris dataset has 150 instances and 4 attributes, wine quality dataset has 6497 instances and 12 attributes, blood transfusion
dataset has 748 instances and 5 attributes and TAE(Teaching Assistant Evaluation) dataset has 151 instances and 6 attributes. All these datasets taken from the UCI repository of machine learning database [9], is used for testing the efficiency and accuracy of the improved algorithm. The same data set is given as input to the original kmeans algorithm and the improved algorithm.
The results of experiments are tabulated in Table 1 and Table 2. In Table 1 we compare both the algorithm for efficiency in term of time for all the datasets. In Table 2 we compare both the algorithms for accuracy for all datasets.
Table 1. Comparison of efficiency of two algorithms
Algorithms Datasets
Kmeans
Improved Kmeans
Centroid Selected randomly
Centroid computed by program
Libras Movement
Time taken in (ms)
478
261
Iris
20
10
TAE
20
02
Wine quality
426
112
Blood Transfusion
35
05
Table 2. Comparison of accuracy of two algorithms
outperforms the standard Kmeans algorithm in terms of efficiency and accuracy.
Figure 1. Comparison of efficiency of two algorithms
Figure 2. Comparison of accuracy of two algorithms
Algorithms
Datasets
Kmeans
Improved Kmeans
Centroid Selected randomly
Centroid computed by program
Libras Movement
Accuracy (in %)
59
61
Iris
66
90
TAE
58
61
Wine quality
26
30
Blood Transfusion
59
64
Algorithms
Datasets
Kmeans
Improved Kmeans
Centroid Selected randomly
Centroid computed by program
Libras Movement
Accuracy (in %)
59
61
Iris
66
90
TAE
58
61
Wine quality
26
30
Blood Transfusion
59
64
Figure 1 and Figure 2 depicts the performances of the standard Kmeans algorithm and the Improved K means algorithm in terms of the efficiency and accuracy for all the datasets. It can be seen from the results that the improved algorithm significantly

Conclusion
The original kmeans algorithm is widely used for clustering large sets of data. But it does not always guarantee for good results, as the accuracy of the final clusters depend on the selection of initial centroids.
Moreover, the computational complexity of the original algorithm is very high because it reassigns the data points a number of times during every iteration of the loop. Here we presents an improved kmeans algorithm which combines a systematic method for finding initial centroids and an efficient way for assigning data points to clusters. This method ensures the entire process of clustering in time without sacrificing accuracy of clusters.

References

S. Kantabutra, Parallel Kmeans Clustering Algorithm on NOWs, Department of Computer Science, Tufts University, 1999

Berkhin Pavel, A Survey of Clustering Data Mining Techniques, Springer Berlin Heidelberg, 2006.

R. Ali, U. Ghani, A. Saeed, Data Clustering and Its Applications, Rudjer Boskovic Institute, 2001

Yuan F, Meng Z. H, Zhang H. X and Dong C. R, A New Algorithm to Get the Initial Centroids, Proc. of the 3rd International Conference on Machine Learning and Cybernetics, pages 2629, August 2004.

Sun Shibao, Qin Keyun,Research on Modified kmeans Data Cluster Algorithm, Computer Engineering, vol.33, No.13, pp.200201, July 2007.

YuFang Zhang, Jiali mao and ZhongYang Xiong, AN EFFICIENT CLUSTERING ALGORITHM, Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xian, 25 November 2003

Juntao Wang and Xiaolong Su, An improved Kmeans clustering algorithm