 Open Access
 Total Downloads : 175
 Authors : P. Navin Karthi, S. Saranya
 Paper ID : IJERTV2IS120134
 Volume & Issue : Volume 02, Issue 12 (December 2013)
 Published (First Online): 06122013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Contrast Scrutiny Sandwiched Between YMean and KMeans Algorithms in Fisher’s Iris Data Sets
P. Navin Karthi*1, S. Saranya*2
*1M.E VLSI Design, Department of Electronics and Communication Engineering, K.S.Rangasamy College of Technology, Tiruchengode637215, TamilNadu, India.
*2M.E VLSI Design, Department of Electronics and Communication Engineering, K.S.Rangasamy College of Technology, Tiruchengode637215, TamilNadu, India.
ABSTRACT: Cluster analysis is an exploratory data analysis tool for solving classification problems. It plays a major role in various fields in order to group similar data from the available database. There are various clustering algorithm available in order to cluster the data but the entire algorithm are not suitable for all process .In this paper mainly address with the comparative performance analysis of k mean and ymean algorithm in Iris flower datasets. The experimental results of iris data set show that the YMeans algorithm based on partition yields the best results in clustering and time complexity compared with kMean algorithm in little iteration time.
Keywords – KMean Algorithm, YMeans Algorithm and Cluster Analysis.

INTRODUCTION
The objective is that objects within a group be similar to one another and different from the objects in other groups. The data objects which have the maximum similarity within a group and the greater the difference between the groups are, the better or more distinct the clustering. Cluster analysis groups the given data objects based on only information found in the data and describes the objects and their relationships. Clustering is an effective technique for exploratory data analysis, and has found applications in a wide variety of areas.
Most Existing methods of clustering can be categorized into three: partitioning, hierarchical, and gridbased and modelbased methods. In this paper, we mainly review two algorithms kmeans and y mean algorithm. The kMeans and ymeans are examples of partitional methods. The ymean and k mean are compared in the data sets of iris flower to cluster the three species of iris flower and the results are obtained in Matlab.

METHODOLOGY
Clustering is one of the most widely performed analyses on gene expression data. Every clustering algorithm is based on the index of similarity or dissimilarity between data points. Each cluster is a collection of data objects that are similar to one another are placed within the same cluster but are dissimilar to objects in other clusters. The iris data sets are taken from three different species in order to classify each species with common data sets. The clustering process for each algorithm differs from in order to classify the similar groups.

THE KMEANS ALGORITHM
KMeans is one of the simplest unsupervised learning algorithms used to partition the given data objects in clustering. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). The main procedure is to initialize k centroids, one for each cluster groups. These centroids have to be selected carefully since their placement will always affect the end result. Finally, the kmean algorithm aims at minimizing an objective function in the data objects, in this case a squared error function. The objective function
Where xa(j)yb2 is a chosen distance measure between a data point xaj and cluster centre cb, is an indicator of the distance of the n data points from their respective cluster centers.
So, the better choice is to place them as far as possible. The algorithm is composed of following steps,

Place k points into the space represented by the objects that are being clustered. The k points represent initial group of centroids.

Assign each object the group that has the closest centroid.

After the assignment of centroids to each objects, recalculate the positions of the k centroids.

Repeat steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups of the objects into groups from which the metric to be minimized can be calculated.
The algorithm is drastically sensitive to the initial randomly selected cluster centers. The kMeans algorithm can be run iteratively to minimize this effect.
Kmeans include:
simplicity and applicability for a wide variety of data types. It is also quite efficient, even after multiple iterations are often performed.
It provide best result when the cluster is intensive and the distinction between clusters is obvious.
It is efficient and scalable when the data set
is large.
Weaknesses of Kmeans include:
It depends on initial centroid and the final number of clusters and undergoes degeneracy.
The algorithm is not apposite for nor convex shape clusters nor cluster sizes that are highly variable.
A sensitivity to noise points, marginal points and isolated points.


THE YMEAN CLUSTERING
Ymeans is based on the Kmeans algorithm. The main difference between the two is Y means ability to autonomously decide the number of clusters based on the statistical nature of the data. This makes the number of final clusters that the algorithm produces a selfdefined number rather than a userdefined constant as in the case of Kmeans. This overcomes one of the main drawbacks of K means since a userdefined k cannot guarantee a suitable partition of a dataset with an unknown distribution, a random value of initial k usually results in poor clustering.
Ymeans can find out an appropriate value of final k (centroids), which is independent of the initial k experiments by using a sequence of splitting, deleting and merging the clusters, even without the knowledge of the distribution of data.
To eliminate the effect of dominating features due to the featurerange differences, the dataset is first normalized. Next, the standard K means algorithm is run over the training data. Due to the fact that the final number of clusters is independent of initial k. Moreover, the selection of the k initial centroids is again independent of the final results.
The standard Kmeans algorithm uses Euclidian distance as a distance function. The Y means algorithm uses the following function to identify a single outlier per iteration for each cluster. The Ymeans algorithm iteratively identifies outliers and converts them to new centroids.


EXPERIMENTAL RESULT
Experimental work is done through MATLAB programming language. An important step in most clustering process is to select a distance measure, which determine the similarity between each data objects from calculation. This will manipulate the shape of the clusters, as some data objects will be close to one another according to one distance and farther away according to another. They are distinction whether the clustering uses symmetric or asymmetric distances.
The iris flower have various species ,in that three species namely Iris setosa, Iris virginica and Iris versicolor are taken for clustering based on the available data sets provided by Fisher's Iris data set. The fifty data sets classify the length and width of sepals, petals of three species commonly. One of the clusters contains Iris setosa and the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information. so this experimental results proves the efficiency by clustering all the three species with time compexity. The kmean algorithm classifies the species with user defined iteration along with dependency of initial centroids and final number of clusters. In output figures we shown about the kmean and ymean clustering for different values.
The output obtained from the clustering are shown below by Kmean clustering are shown in fig 1 and fig2 shows the iteration when N=40.
Fig1: KMEAN CLUSTERING
Fig 2: KMEAN CLUSTERING WHEN N=40
Fig 3: YMEAN CLUSTERING
Fig 4: YMEAN CLUSTERING WHEN N=25
The Ymean clustering algorithm overcomes the drawbacks of kmean clustering by working on trained set of normalized input data. Then undergoes the process of splitting, merging with the deletion of empty clusters to avoid degeneracy. The output obtained by Ymean cluster the data set in few iteration N=25 when compared with kmean is shown in figure 3 and 4.The red ,blue and green color indicates the three species of Iris flowers in it.
TABLE 1 KMEAN RESULTS
TABLE 2 YMEAN RESULTS
Fig 5: AVERAGE RUN TIME
The results of both the algorithms are analyzed based on the number of data points and the computational time of each algorithm. The performance of the partition algorithm is analyzed by experimental results in iris data sets. The number of data points is clustered by the algorithm as per the distribution of arbitrary shapes of the data points. Time complexity analysis is a part of computational complexity theory that is used to describe an algorithms use of computational resources; in this case, the best case and the worst maximum and minimum time taken by the YMeans algorithm is 172 and 156 respectively. Like, from table 2, 221 and 196 are the maximum and minimum time taken by the Kmean algorithm.
The performance of the algorithms have been analyzed for several iterations by considering different data points (for which the results are not shown) as input (300 data points, 400 data points etc.) and the number of clusters are 10 and 15 (for which also the results are not shown), the obtained results are found to be highly adequate. Figure 5 shows that the graph of the average results of the distribution of data points. The average execution time is taken from the tables 1 and 2. It is easy to identify from the figure 5 that there is a difference between the times of the algorithms. Here, it is found that the average execution time of the YMeans algorithm is very less by comparing the KMeans algorithm.

CONCLUSIONS
Experimental results show that Ymeans has a very good performance compared Kmeans. This paper presents comparative analysis of a two unsupervised clustering algorithm namely ymean and kmean for data classification. Furthermore, we also analyzed the overall performance of our algorithm by using three different species of iris datasets as initial input for clustering. The outcomes of this experiment provide the best clustering of the data set for Ymeans than kmean in few iteration steps and improved run time. Our future work will mostly concentrate on various ways to improve not only the performance of the algorithm but also on the accuracy and efficiency of the clustering.
References

Chan, P.K., Mahoney, M.V., Arshad, and M.H.: Managing cyber threats: Issues, approaches, and challenges.In: Learning Rules and Clusters for Anomaly Detection in Network Traffic, ch.3, pp. 81
99. Springer, Heidelberg (2005).

Cortes, C., Vapnik, V.: Supportvector networks. Machine Learning 20(3), 273297 (1995) 3. Cover, T., Hart, P.G.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT13(1), 2127 (1967).

Maria Camila N. Barioni, Humberto L.Razente, Agma J. M. Traina, Caetano Traina Jr, An efficient approach to scale up kmedoid based algorithms in large databases, 2006.

Marta V. Modenesi, Myrian C. A. Costa, Alexandre G. Evsukoff,, and Nelson F.F.Ebecken, Parallel Fuzzy C Means Cluster Analysis, High Performance Computing for Computational Science
– VECPAR 2006.

Kaufman, L. and P.J. Rousseeuw,Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons, 1990

Manikandan .R, Improving Efficiency of textual static web content mining using clustering techniques, Journal of Theoretical and Applied Information Technology, Vol.33,No.2,2011