 Open Access
 Total Downloads : 447
 Authors : Shaik Jani
 Paper ID : IJERTV1IS5263
 Volume & Issue : Volume 01, Issue 05 (July 2012)
 Published (First Online): 02082012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
An Efficient KMeans and CMeans Clustering Algorithm for Image Segmentation
Abstract
In this paper, we present a novel algorithm for performing kmeans clustering. It organizes all the patterns in a kd tree structure such that one can find all the patterns which are closest to a given prototype efficiently. The main intuition behind our approach is as follows. All the prototypes are potential candidates for the closest prototype at the root level. However, for the children of the root node, we may be able to prune the candidate set by using simple geometrical constraints. This approach can be applied recursively until the size of the candidate set is one for each node. Our experimental results demonstrate that our scheme can improve the computational speed of the direct kmeans algorithm by an order to two orders of magnitude in the total number of distance calculations and the overall time of computation.
Keywords: Pattern recognition, machine learning, data mining, kmeans clustering, nearestneighbor searching, kd tree, computational geometry, knowledge discovery.
Clustering is the process of partitioning or grouping a given set of patterns into disjoint clusters. This is done such that patterns in the same cluster are alike and patterns belonging to
two different clusters are different. Clustering has been a widely studied problem in a variety of application domains including neural networks, AI, and statistics. Several algorithms have been proposed in the literature for clustering: ISODATA [8, 3], CLARA [8],
CLARANS [10], Focusing Techniques [5] P
CLUSTER [7]. DBSCAN [4], Ejcluster [6],
BIRCH [14] and GRIDCLUS [12].
The kmeans method has been shown to be effective in producing good clustering results for many practical applications. However, a direct algorithm of kmeans method requires time proportional to the product of number of patterns and number of clusters per iteration. This is computationally very expensive especially for large datasets. We propose a novel algorithm for implementing the kmeans method. Our algorithm produces the same or com parable (due to the roundoff errors) clustering results to the direct kmeans algorithm. It has significantly superior performance than the direct kmeans algorithm in most cases. The rest of this paper is organized as follows. We review previously proposed approaches for improving the performance of the kmeans algorithms in Section 2. We present our algorithm in Section
3. We describe the experimental results in Section 4 and we conclude with Section 5.
In this section, we briefly describe the direct kmeans algorithm [9, 8, 3]. The number of clusters _ is assumed to be fixed in kmeans clustering. Let the _k prototypes be initialized to one of the input patterns Therefore,
wj=i1,j {,K},l{the direct kmeans clustering ,n}algorithm. Cj is the nth cluster whose value is a disjoint subset of input patterns. The quality of the clustering is determined by the following error function:
The appropriate choice of _ is problem and domain dependent and generally a user tries several values of _. Assuming that there are _ patterns, each of dimension @, the computational cost of a direct kmeans algorithm per iteration (of the repeat loop) can be decomposed into three parts:

The time required for the first for loop in Figure 1 is O(nkd)

The time required for calculating the centroids (second for loop in Figure 1) is O(nd)

The time required for calculating the error function is O(nd)
The number of iterations required can vary in a wide range from a few to several thousand depending on the number of patterns, number of clusters, and the input data distribution. Thus, a direct implementation of the kmeans method can be computationally very intensive. This is especially true for typical data mining
applications with large number of pattern vectors
Figure 1. Direct kmeans clustering algorithm
There are two main approaches described in the literature which can be used to reduce the overall computational requirements of the k means clustering method especially for the distance calculations:

Use the information from the previous iteration to reduce the number of distance calculations. PCLUSTER is a kmeansbased clustering algorithm which exploits the fact that the change of the assignment of patterns to clusters is relatively few after the first little iteration [7]. It uses a heuristic which determines if the closest prototype of a pattern E has been changed or not by using a simple check. If the assignment has not changed, no further distance
calculations are required. It also uses the fact that the movement of the cluster cancroids is small for consecutive iterations (especially after a few iterations).

Organize the prototype vectors in a suitable data structure so that finding the closest prototype for a given pattern becomes more efficient [11, 13]. This problem reduces to finding the nearest neighbor problem for a given pattern in the prototype space. The number of distance calculations using this approach is proportional to iteration. For many applications such as vector quantization, the prototype vectors are fixed. This allows for construction of optimal data structures to find the closest vector for a given input test pattern [11]. However, these optimizations are not applicable to the k means algorithm as the prototype vectors will change dynamically. Further, it is not clear how these optimizations can be used to reduce the time for calculation of the error function (which becomes a substantial component after reduction in the number of distance calculations).
The main intuition behind our approach is as follows. All the prototypes are potential candidates for the closest prototype at the root level. However, for the children of the root node, we may be able to prune the candidate set by using simple geometrical constraints. Clearly, each child node will potentially have different candidate sets. Further, a given prototype may belong to the candidate set of several child nodes. This approach can be applied recursively till the size of the candidate set is one for each node. At this stage, all the patterns in the subspace represented by the subtree have the Using this approach, we expect that the number of distance calculation for the first loop (in Figure 1) will be proportional to where K d
Tree is much smaller than Hk Means This is because the distance calculation has to be performed only with internal nodes (representing many patterns)and not the patterns themselves in most cases. sole candidate as their closest prototype.
Example of pruning achieved by our algorithm. X represents the candidate set. D is the MinMax distance. All the candidates@ which are circled get pruned. The candidate with a square around it is not pruned by our algorithm
This approach can also be used to significantly reduce the time requirements for calculating the prototypes for the next iteration We also expect the time requirement for the second for loop to be proportional to The improvements obtained using our approach are crucially dependent on obtaining good pruning methods for obtaining candidate sets for the next level. We propose to use the following strategys For each candidate _ 4, find the minimum and maximum distances to any point in the subspace Find the minimum of maximum distances, call out all candidates with minimum distance greater

We have evaluated our algorithm on several datasets. We have compared our results ith direct kmeans algorithm interns of the number of performed distance calculations and the total execution time. A direct comparison with other algorithms(such as the PCluster [7] and [13] ) is
not feasible due to availability of their datasets and software. However,
we present some qualitative comparisons. All the experimental results reported are on a IBM RS/6000 runningAIX version 4. The clock speed of the processor is 66 MHz and the memory size is 128 MByte.For each dataset and the number of clusters, we compute the factors K _means and HK means of reduction in distance calculations and overall execution time over the direct algorithm respectively as well as the average number of distance calculations per pattern The number of distance calculations for the direct algorithm All time measurements are in seconds.
Our main aim in this paper is to study the computational aspects of the kmeans method. We used several datasets all of which have been generated synthetically. This was done to study the scaling properties of our algorithm for different values of k and c respectively. For most of our datasets, we found that choosing a leaf size of 64 resulted in optimal or near optimal performance. Further, the overall performance was not sensitive to the leaf
size except when the leaf size was very small.
Table present the performance of our algorithms for different number of clusters and iterations assuming a leaf size of 64. For each combination used, we present the factor reduction in overall time (FRT) and the time of the direct kmeans
algorithm. We also present the factor reduction in distance calculations (FRD) and the average number of distance calculations per pattern (ADC). These
Results show that our algorithm can improve the overall performance of kmeans clustering by an order to two orders of magnitude. The average number of distance calculations required is very small and can vary anywhere from 0.17 to11.17 depending on the dataset and the number of clustersrequired.The results presented in [7] show that their methods result
in factor of 4 to 5 improvements in overall computational time. Our improvements are substantially better. However, we note that the datasets used are different and a direct comparison may not be accurate.
The overall results for 10 iterations
In this paper, we presented a novel algorithm for performingmeans clustering. Our experimental results demonstrated that our scheme can improve the direct k means algorithm by an order to two orders of magnitude in the total
number of distance calculations and the overall time of computation. There are several improvements possible to the basic strategy presented in this paper. One approach will be to restructure the tree every few iterations to further reduce the value of The intuition here is that the earlier iterations provide some partial clustering information. This information can potentially be used to construct the tree such that the pruning is more effective. Another possibility is to add the optimizations related to incremental approaches presented in [7]. These optimizations seem to be orthogonal and can be used to further reduce the number of distance Calculations with using of CMeans Clustering

K. Alsabti, S. Ranka, and V. Singh. An Efficient KMeans Clustering Algorithm. http://www.cise.ufl.edu1997.

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill Book Company, 1990.

R. C. Dubes and A. K. Jain. Algorithms for Clustering Data.Prentice Hall, 1988.

M. Ester, H. Kriegel, J. Sander, and X. Xu. A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of the 2nd Intl Conf. on Knowledge Discovery and Data Mining, August 1996.

M. Ester, H. Kriegel, and X. Xu. Knowledge Discovery inLarge Spatial Databases: Focusing Techniques for Efficient

J. Garcia, J. FdezValdivia, F. Cortijo, and R. Molina. Dynamic Approach for Clustering Data. Signal rocessing,44:(2), 1994.

D. Judd, P. McKinley, and A. Jain. Large Scale Parallel Data Clustering. Proc. Intl Conference on Pattern Recognition,
August 1996.

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley & Sons,1990.

K. Mehrotra, C. Mohan, and S. Ranka. Elements of ArtificialNeural Networks. MIT Press, 1996.[10] R. T. Ng and J. Han. Efficient and Effective Clustering Methodsfor Spatial Data Mining. Proc. of the 20th Intl Conf.
on Very Large Databases, Santiago, Chile, pages 1441551994.

V. Ramasubramanian and K. Paliwal. Fast KDimensional Tree Algorithms for Nearest Neighbor Search with Application to Vector Quantization Encoding. IEEE Transactions
on Signal Processing, 40:(3), March 1992.

E. Schikuta. Grid Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets. Proc. 13thIntl. Conference on Pattern Recognition, 2, 1996.

J. White, V. Faber, and J. Saltzman. United States Patent No.5,467,110. Nov. 1995.

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases.Proc. of the 1996
ACM SIGMOD Intl Conf. on Management of Data, Montreal, Canada, pages 103114, June 1996.