 Open Access
 Total Downloads : 253
 Authors : P Kiran Kumar Reddy
 Paper ID : IJERTV4IS070474
 Volume & Issue : Volume 04, Issue 07 (July 2015)
 DOI : http://dx.doi.org/10.17577/IJERTV4IS070474
 Published (First Online): 17072015
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Evaluation of Clustering Tendency by Enhanced Visual Access Tendency Method in KMeans Clustering Approach
P Kiran Kumar Reddy
Department of CSE
Narayana Engg.College, Gudur, A.P, INDIA
Abstract – The data clustering is an emerging technique in various data mining applications and it plays vital role in classifying unlabeled data objects. Kmeans is one of the best and effective data partitioning method but it may suffer from unknown clustering tendency. The proposed method extends the kmeans based clustering algorithm with Visual Access Tendency (VAT) procedure, called as VAT based kmeans Clustering. This hybrid approach speeds up the clustering results. The kmeans is unsupervised approach and it can solve the clustering problem of unlabeled data. VAT is a data visualization method and it determines the clustering tendency of unlabeled data. The existing system takes more run time when there are several iterations where as the proposed system takes single step with very less run time. Clustering validity is to be checked at every iterated kmeans clusters by Dunns Index. Higher Dunns Index imposes the exact clustering. Key contribution of the paper is to find prior tendency in kmeans Based Clustering by Visual Access Tendency and to find clustering results in a single step instead of several trails. Results are tested on synthetic data sets and real data sets to conclude the clustering results are improved by proposed method with respect to the runtime.
Keywords – Clustering Analysis, Clustering Tendency, kmeans, Dunns Index, Visual Access Tendency (VAT).

INTRODUCTION
Clustering has been widely used in data analysis. The data clustering[26,27,28,29] is an emerging technique in various data mining applications[18,19], and it plays important role in classifying unlabeled data objects. Despite the exiting traditional clustering algorithms, k means [21,23] is the best and effective data partitioning method. The kmeans algorithm [1] is a classical approach and it is greatly succeed in many practical domains [10,11,14,15]. This clustering method may suffer from unknown clustering tendency [13]. Thus, several trails of execution (at k=2, 3, 4.., where k refers the number of clusters) is needed for kmeans algorithm for obtaining of correct k value. The assessment of clustering tendency and finding the quality of clustering results is timeconsuming in kmeans. Several methods are investigated for automatic clusters detection (k), among these methods, the Visual Access Tendency (VAT) [20] is an optimal choice for detecting the clustering tendency or number of clusters. Therefore, the proposed work can assess the clustering tendency by VAT in kmeans clustering algorithm and this proposed approach is known as VATbasedkmeans
clustering algorithm. This paper carried out the experiments on several datasets for demonstrating the effectiveness of proposed hybrid approach. This paper contributes the proposed work on extensive ideas of k means based clustering [22]aiming to extract the tight clusters in order to get two benefits; first is to reduce the time and second is to improve the time values since, we use the known tendency value in kmeans clustering algorithms. The major objective of our proposed method is to make best usage of tendency value in kmeans based clustering algorithms for improving performance values. Assessment of tendency is one of the important criteria during clustering analysis. Exact tendency values are inferred from VAT techniques.
Related work of kmeans based clustering is presented in Section II. Concept of Enhanced Visual Access Tendency is discussed in Section III. Proposed method is described in Section IV. Section V describes Results & discussions and finally conclusion & future work is presented in the last Section.

KMEANS BASED CLUSTERING
Among the clustering algorithms kmeans is a simple and efficient clustering method. The aim of kmeans is to generate the faster and efficient clustering results. However, it generates the data partition results without knowledge of prior clustering tendency i.e., the value of k is unknown and it is given by the user as approximately. This method may suffer from unknown clustering tendency. Therefore, it is required to run kmeans clustering algorithms for several times as trails for finding the best data partitioning because the initial number of clusters (k) is unknown. Suppose the value of clustering tendency is known, and then it is enough to run the algorithm as a single time instead of several times.
The kmeans is unsupervised approach and it can solve the clustering problem of unlabeled data. The procedure of kmeans [2] follows a classical approach, where it classifies the dataset through an assumed number of clusters. The basic idea is to create the k centers by selecting randomly k initial objects, and one for each cluster. Assign each object at a time from the remaining objects to the closest cluster [3]. Update the centers or means after assigning each object into the cluster. It continued until assigning of all objects into their respective
clusters. The kmeans approach is minimizing the objective function, and it is given by the Eq.1
Where Xi (j) – Cj2 refers the distance between a data point Xi (j) and a cluster center Cj , it indicated the distance between data points, and a cluster center.
Three important limitations are identified in present system. These are clustering tendency is unknown, it doesnt generate the exact number of clusters without knowledge of tendency, and it requires external interference for specifying termination condition. Therefore, the purpose of assessing tendency, we propose the specific visualization method in kmeans based clustering algorithms for detecting exact tendency value.

VISUAL ACCESS TENDENCY (VAT) AND ENHANCED VAT METHODS
Visual Access Tendency (VAT) [7, 8, 9] is a data visualization method and it determines the clustering tendency of unlabeled data. Kmeans algorithm suffers from unknown clustering tendency. Therefore, this method is merged with VAT for addressing the assessment of clustering tendency. The proposed hybrid approach is known as VAT based kmeans clustering algorithm.
The clustering algorithm kmeans use the VAT for determining the number of valid clusters. The VAT reorders the dissimilarity [25]matrix D to D* (n Ã— n , where n refers data objects) using Prims logic, and it generates the image of D*, I(D*) (it is known as VAT Image). The VAT is reveals the hidden clustering structures by squared shaped dark blocks of I (D*). In the VAT, the clustering tendency is accessed by assessing the information of a total number of squareshaped dark blocks. The VAT is an effective procedure for detecting the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image. The results of VAT depend on dissimilarity or similarity features of objects. Therefore, the way for finding the dissimilarities is most important in VAT algorithm. Generally, the dissimilarity matrix in VAT is computed in Euclidean space. From the statistical evidence of [4], it is noted that the cosine metric is more robust in similarity features or dissimilarity features computation. Therefore, the present paper proposed to use a cosine space instead of Euclidean in an Enhanced VAT (EVAT). The EVAT uses the New Dissimilarity (ND), in which the construction of ND is shown in Eq.2.
ND (x, y) = 1 (xy xy) (2)
Here clustering tendency results of VAT in kmeans based clustering are use for achieving the effective clustering results. Fig 1 shows the steps for proposed approach. This approach addresses the clustering tendency by visual methods and produces the quality of clustering
results at a known clustering tendency. The advantages of the proposed method are:

This hybrid approach generates the quality of clustering results because the clustering tendency is known.

The substantial amount of execution time is saved because this approach need not to check the clustering results at different k values, they generate explicitly at known k value.

The clustering tendency problem is detected effectively by proposed EVAT for some typical datasets.
Fig. 1 VATbasedkmeansClustering


PROPOSED METHOD
The proposed hybrid approach consists of two major steps: The first step deals the method to find the value of cluster count by assessment of clustering tendency through data visualization method (VAT) for unlabeled data. The second step uses the standard clustering approach i.e. k means based clustering algorithm, where it is applied on datasets for the known clustering tendency is obtained from the first step for discovering the faster and efficient clustering results.
Algorithm presents proposed hybrid approach in which the clustering tendency is detected from Step 1 and to discover the clustering results from Step 2. The Step 2 illustrates the kmeans procedure. Hence, this hybrid approach is known as a VATbasedkmeans clustering algorithm.
Algorithm: VATBasedkmeansClustering (For unlabeled dataset)
Step 1:

Find Reordered dissimilarity image (I) using VAT/EVAT.

Apply Image threshold on I.

Find histograms by applying consecutive operations of 2D FFT, Inverse of FFT and Correlation.

Extract the cluster count k either from the number of histograms or squareshaped dark blocks of VAT/ EVAT Image.
Step 2:

Place k (k is known from step1) number of initial points into the space represented by the objects that are being clustered i.e. initial group centroids.

Assign each data object into the group that has the closest centroid.

If the objects assigned into closest clusters, then recalculate the positions of the k centroids.
Repeat above b and c Steps until the centroids are no longer move. This produces a separation of the objects into groups


RESULT AND DISCUSSIONS
The proposed study has presents the experimental results based on various synthetic data sets from S1 to S4 as shown in Fig. 3 collected from the UC Irvine Machine Learning Repository [5] for evaluating the performance of the method with respect to quality and run time.
Fig. 2 Synthetic Datasets (S1 to S4)
In evaluation of result analysis, the existing system doesnt have the prior value of clustering tendency. External interference is required for clustering tendency, so obtained results of existing system may or may not have good Dunns Index[12]. The higher value of Dunns Index indicates the good number of valid clusters for given data. Dunns Index is a metric for evaluating of correct partitioning [6]. Kmeans clustering algorithm is experimented several times until getting the good Dunns Index. So, the problem of existing system is runtime. Therefore, the proposed work first solves the problem of tendency by extracting of obtained square shaped dark blocks, secondly it retrieves kmeans based clustering results based on tendency. This procedure output the VAT image for input dataset. After that we apply 2D FFT, inverse FFT, and correlation[16,17] on getting VAT image. Finally, the clustering number is extracted that which is referred as clustering tendency.
Fig. 3: Histograms for Synthetic data for clusters count (k)
The clustering algorithm cannot detect the clustering tendency. External user interference is required for finding the clustering tendency. However, the user is intractable to detect the suitable number of clusters. Therefore, the proposed hybrid systems use the VAT for detecting the number of clusters. Figure 4 and 5 shows the outputs of VAT for synthetic and real datasets respectively. The VAT Image gives the more informative assessment of clustering tendency by squareshaped dark blocks. VAT can access the number of clusters by squareshaped dark blocks. Each squareshaped dark blocks represent as a single cluster in VAT Image.
Fig. 4 VAT Images for Synthetic Datasets
The VAT can also detect number of clusters using VAT histograms. The VAT histograms are constructed by applying a series of three steps on VAT Image, which are 2D FFT, Inverse FFT, and correlation.
Fig.5 VAT Images for Real datasets
The clustering tendency is unknown in kmeans clustering algorithm, hence, this section conduct the experiments for this algorithm at different kvalues and perform the postvalidation of these clustering results by Dunns Index (DI) for determining the best clustering. Table 1 and 2 illustrates kmeans clustering results for synthetic and real datasets respectively, here the clustering tendency i.e. k value is unknown. However, the best k
values in existing methods are found using the Dunns Index value. The maximum value of DI indicates the best clustering.
The Dunns Index computes the distance for the pairwise data objects, and it calculates the distance between two inter clusters. We use the Eq.3.
The problem of this approach is that it take more runtime for validating the clustering tendency and clustering results, because it is required to run the present clustering algorithms at multiple times for different k values. Since, k value is unknown.
The proposed hybrid method obtains the value of clustering tendency by VAT. Thus, this hybrid method, namely, VATbasedkmeans clustering are faster. Table 3 & 4 shows the Dunns Index and runtime results for VAT basedkmeans algorithm for synthetic and real datasets where Clustering Tendency is known.
Table 1: Dunns Index and Runtime Results for Synthetic datasets (k means Based ClusteringClustering Tendency is Unknown)
Datas ets
Dunns Index for Number
of clusters (C)
Run time
(Sec)
C=2
C=3
C=4
C=5
S1
0.44
0.03
0.02
0.02
0.19
S2
0.41
0.79
0.04
0.03
0.25
S3
0.19
0.31
0.37
0.01
0.26
S4
0.02
0.02
0.01
0.22
0.30
Table 2: Dunns Index and Runtime Results for Realtime Datasets (k means ClusteringClustering Tendency is Unknown)
Dataset s
Dunns Index for Number
of clusters (C)
Run time (Sec)
C=2
C=3
C=4
C=5
Iris
0.06
0.05
0.01
0.08
0.59
Wine
0.02
0.01
0.02
0.02
0.25
Vote
0.19
0.20
0.14
0.14
0.35
Glass
0.03
0.04
0.01
0.06
0.27
Table 3: Dunns Index and Runtime Results for Synthetic Datasets (Proposed Approach Clustering Tendency is Known)
Synthetic Datasets
Dunns Index (Clustering Tendency C is extracted from proposed approach)
Runtime (Sec)
S1
0.44(C=2)
0.09
S2
0.79 (C=3)
0.14
S3
0.37 (C=4)
0.21
S4
0.22 (C=5)
0.24
Table 4: Dunns Index and Runtime Results for Real Time Datasets (Proposed Approach Clustering Tendency is Known)
Synthetic Datasets
Dunns Index (Clustering Tendency C is extracted from proposed approach)
Runtime (Sec)
Iris
0.06(C=2)
0.19
Wine
0.02 (C=2)
0.14
Vote
0.20 (C=3)
0.18
Glass
0.04(C=3)
0.10
Fig. 6 shows the runtime comparison between the proposed VATbasedkmeans and kmeans clustering. This comparison analysis shows that hybrid methods are faster.
Fig.6. Runtime Comparison: between kmeans and VAT basedkmeans clustering

CONCLUSIONS
This proposed hybrid approach perform well toward to both visual assessment of clustering tendency and clustering results. This work evaluates the performance of proposed method by two measures i.e., clustering accuracy and normalized mutual information. The experimental results demonstrate that VATbasedkmeans is efficient and faster. However, the VAT requires more runtime for assessment of clustering tendency. Experimental results are tested on various synthetic datasets. Runtime and Dunns Index values are evaluated and compared in both existing and proposed systems. According to the results analysis, the proposed system requires less time than existing system and also produces high quality of clustering results after observing of Dunns Index value. Higher value of Dunns
Index concludes the good clustering results. The future scope of the work is to obtain best indexed clustering results by techniques of sampling method and spectral approach in our proposed method.
REFERENCES
[ 1] Jain A.K.;Murty, M.N.; Flynn, P.J.; Data Clustering: A Review, Journal Acm Computing Surveys,31(3), 1999, 264 323. [ 2] Jain, A.K.; Data clustering: 50 years beyond kmeans, Pattern Recognition Letters 31(8), 2010, 651666. [ 3] Qinpei, Z.; Pasi, F.; Centroid Ratio for a Pairwise Random Swap Clustering Algorithm, IEEE Transactions on Knowledge and Data Engineering, 26(5) ,2014. [ 4] Senoussaoui, M.; Kenny, P.; Stafylakis, T.; A study of the cosine distancebased mean shift for telephone speech diarization, IEEE Transactions on Audio, Speech, and Language Processing,22(1),2014,217227. [ 5] http.// archive.ics.uci.edu/ml/datasets.html. [ 6] Cai, D.; He X.; Han, J.; Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering, 17(2),2005, 16241637. [ 7] L. Wang, T.Nguyen, J.Bezdek, C. Leckie , and K.Rammohanarao, iVAT and aVAT: Enhanced visual analysis for clustering tendency assessment in Proc PAKDD,India, Jun 2010. [ 8] Timothy C. Havens, James C. Bezdek, An efficient formulation of the improved visual assessment of cluster tendency IEEE Trans on Knowledge and Data Engineering,Nov,2011. [ 9] J.Bezdek and R.Hathaway, VAT: A tool for visual assessment (cluster) tendency, in Proc. IJCNN, Honolulu, Hi,2002, pp.222530. [ 10] M.Ester; P. Kriegel; J. Sander; X.xu, A density based algorithm for discovering clusters in large databases with noise ,Int Conference on knowledge discovery and data mining 1996 ,pp 226231. [ 11] W. Wang; J. Yang; R. Muntz, STING: A statistical information grid approach to spatial data mining, Int Conf on very large data bases, pp 186195. [ 12] T.C. Havens, J.C.Bezdek, J.M.Keller,M. Popescu, Dunns Cluster Validity Index as Contrast Measure of VAT Images Int Conf IEEE 2008. [ 13] J.C. Bezdek, R.J. Hathaway, and J. Huband, Visual Assessment of Clustering Tendency for Rectangular Dissimilarity Matrices. IEEE Trans. vol. 15, no. 5, pp. 890 903, 2007. [ 14] I Dhillon, D. Modha, and W. Spangler, Proc. 30th Symp. Interface: Computing Science and Statistics, 1998, Visualizing Class Structure of Multidimensional Data. [ 15] T. TranLuu, PhD dissertation, Univ. of Maryland, College Park, Mathematical Concepts and Novel Heuristic Methods for Data Clustering and Visualization, 1996. [ 16] M. Breitenbach and G. Grudic, Clustering through Ranking on Manifolds, Proc. 22nd Intl Conf. Machine Learning (ICML), 2005. [ 17] R.B. Cattell, A Note on Correlation Clusters and Cluster Search Methods, Psychometrika, vol. 9, no. 3, pp. 169184,1944.
[ 18] P. Sneath, A Computer Approach to Numerical Taxonomy,J. General Microbiology, vol. 17, pp. 201226, 1957.
[ 19] Rousseeuw, P. J.: A Graphical Aid to the Interpretations and Validation of Cluster Analysis. J. Computational and Applied Math., Vol. 20, 1987, pp. 5365. [ 20] T.C. Havens, J.C. Bezdek, J.M. Keller, M. Popescu, and J.M. Huband, Is VAT Really Single Linkage in Disguise? Pattern Recognition Letters, 2008. [ 21] Pena J. M., Lozano J. A. and Larranaga P., 1999. An empirical comparison of four initialization methods for the k meansalgorithm, Pattern Recognition Letters, Vol. 20, No. 10, pp. 10271040. [ 22] Xu R. and Wunsch D., 2005. Survey of clustering algorithms, IEEE Trans. Neural Networks, Vol. 16, No. 3, pp. 645678. [ 23] Xu Junling, Xu Baowen, Zhang Weifeng, and Hou Jun, Stable initialization scheme for Kmeans clustering, IEEE Trans.,Vol. 6, No.1. 2009. [ 24] Savitzky, A.Golay, M. J. E.: Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry, Vol. 36, 1964, No. 8, pp. 16271639. [ 25] Havens, T.Bezdek, J.Keller, J.Popescu, M.: Clustering in Ordered Dissimilarity Data. Technical report, Univ. of Missouri 2007. [ 26] Maimon, O.Rokach, L.: Decomposition Methodology for Knowledge Discovery and Data Mining. World Scientific 2005, pp. 9094. [ 27] Bezdek, J. C.Pal, N. R.: Some New Indices of Cluster Validity. IEEE Trans. System, Man and Cybernetics, Vol. 28, 1998, No. 3, pp. 301315. [ 28] Tibshirani, R.Walther, G.Hastie, T.: Estimating the Number of Clusters in a Data Set via the Gap Statistics. J. Royal Statistical Soc. B, Vol. 63, 2001, pp. 411423. [ 29] Calinski, R. B.Harabasz, J.: A Dendrite Method for Cluster Analysis. Comm. In Statistics, Vol. 3, 1974, pp. 127.