 Open Access
 Total Downloads : 298
 Authors : Yuvraj Sase, Prof. Jaypal P. C.
 Paper ID : IJERTV3IS111103
 Volume & Issue : Volume 03, Issue 11 (November 2014)
 Published (First Online): 26112014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Improved KMeans Algorithm for Categorical Dataset
Mr. Yuvraj Sase
Department of Computer Engineering, Vishwabharti Academys College Of Engineering, Ahmednagar, India
Prof. Jaypal P. C.
Department of Computer Engineering, Vishwabharti Academys College Of Engineering, Ahmednagar, India
Abstract KMeans algorithm is most popular clustering algorithm. KMeans algorithm classifies the objects based attribute / features into K number of groups where user input K is number of cluster. But there is no any thumb rule to calculate K value. User has to calculate K value by trial and error method. It is very hard and inefficient for user to check and analyze every possible input and best among them. There is need to find an algorithm independent of K. So that improved KMeans algorithm was proposed which does not require K as input. But improved KMeans algorithm doesnt work for categorical datasets. The real world generates lot off categorical data. There is need KMeans which work for categorical dataset. This paper proposed an improved K Means algorithm which works for categorical datasets.
Keywords KMeans algorithm, Clustering, Categorical , Numerical, Distance, Datapoints, Grouping, dissimilarity, Centroid, Mean

INTRODUCTION
In todays world data is in every sector like banking, hospital, trade marketing, social media etc. They produce both types of data like categorical data, numerical data. There are actually two types of datasets first numerical and second categorical. Numerical contains numeric data like 5, 10 ,15 etc. and categorical dataset contains data which can be divided into categories for example color attribute having categorical data as red, green, yellow etc.
There is always need categorize such data for example selecting area wise customer, grouping patient according their disease etc. For this purpose many clustering algorithms are proposed. Clustering is process which group all objects depending on their dissimilarity i.e distance between objects. Among all of them KMeans algorithm is most popular clustering algorithm. Kmeans clustering algorithm divide data into K number of cluster where K is user input, but user does not have any thumb rule to calculate this Kvalue. User has calculate this value by trial and error method where user has to check every possible value for K. User analyze result generated from every value and find best K value. This is very inefficient and time consuming process. So there was need to propose algorithm which is independent of K. For this purpose an improved Kmeans algorithm was proposed that independent of K. This algorithm finds clusters without K value. But improved KMeans algorithm does not
work for categorical type of data. This paper mainly focuses on improved Kmeans for categorical type of data. The main idea is convert this categorical data set into numerical dataset, so that categorical data get some value which can be used by improved KMeans algorithm and form clusters. So any type of data can be grouped by proposed algorithm.

BACKROUND AND RELATED WORK Dynamic clustering is achieved by improving K
Means algorithm. The main purpose of dynamic clustering is to improve quality of cluster. This algorithm also fixes the optimal number of cluster. For this algorithm both fixed and dynamic work well. This algorithm use intra distance and inter distance to calculate cluster where intra distance is distance between cluster centroid and cluster data points , inter distance is distance between cluster centroids of each cluster.
Variance and median also can be used to initialize cluster center where variance means how far a set of numbers is spread out i.e. distance between each point and mean value
, median means middle value. Diagonal also can be used to calculate initial cluster center. Firstly data points are divided into K number of rows and K number of columns. Then width and height is calculated as in,
Now Upper left point of cell is selected as base points. Area for initial centroid is chosen by moving this base points left
up to in Xaxis and in Yaxis and right up to
in Xaxis and in Yaxis. Now centroids are selected randomly in area form by diagonal points. When centroids are found then KMeans algorithm is applied on it. YMeans algorithm is also used to find out initial centroids. Ymeans uses sequence of splitting, deleting, and merging the clusters.

EXISTING SYSTEM
KMeans algorithm takes K (no. of clusters) input from user and there is no any thumb rule to calculate K value. User has to calculate by trial and error method and this is very inefficient. So an algorithm was proposed called as Improved KMeans algorithm [1]. Improved KMeans algorithm works independent on K value for numerical type of data. Number of clusters (K) does not require as input in modified KMeans algorithm. This algorithm use outliers to calculate value of number of clusters i.e. K. Categorical data is one of problem for this algorithm.
Input:
D: The set of n tuples with attributes Al, A2 . . . Am where m = no. of attributes. All attributes
are numeric Output:
Suitable number of clusters with n tuples distributed properly.
Method:

Compute sum of the attribute values of each tuple (to find the points in the data set which are farthest apart).

Take tuples with minimum and maximum values of the sum as initial centroids.

Create initial partitions (clusters) using Euclidean distance between every tuple and the initial centroids.

Find distance of every tuple from the centroid in both the initial partitions. Take d=minimum of all distances (other than zero).

Compute new means (centroids) for the partitions created in step 3.

Compute Euclidean distance of every tuple from the new means (cluster centers) and find the outliers depending on the following objective function: If Distance of the tuple from the cluster mean < d then not an Outlier.

Compute new centroids of the clusters.

Calculate Euclidean distance of every outlier from the new cluster centroids and find the outliers not satisfying the objective function in step 6.

Let B= {Yl, Y2 …Y p) be the set of outliers obtained in Step 8 (value of k depends on number of outliers).

Repeat until (B== )

Create a new cluster for the set B, by taking mean value of its members as centroid.

Find the outliers of this cluster, depending on the objective function in step 6.

If no. of outliers = p then

Create a new cluster with one of the outliers as its member and test every other outlier for the objective function as in step 6.

Find the outliers if any


Calculate the distance of every outlier from the centroid of the existing clusters and adjust the outliers the existing which satisfy the objective function in step 6.

B = {ZI, Z2 …. Zq} be the new set of outliers (value of q depends on number of outliers).



PROPOSED WORK
The proposed algorithm converts categorical into numerical data set. Some equations are formed using distance between data points. These equations are used to calculate value of categorical data. The calculated values are used as input dataset for improved KMeans to divide data into groups without knowing value of K. In this way grouping is done for categorical dataset. Steps of proposed algorithm for converting categorical data ito numerical data are given below. template is designed so that author affiliations are not repeated each time for multiple authors of the same affiliation. Please keep your affiliations as succinct as possible (for example, do not differentiate among departments of the same organization).

Take input dataset D contains categorical as well as numerical dataset.

Calculate number of categories as cate [].

Until (cate[] == )

Select two categories as C1 and C2.

Search two data points as P1 and P2 having C1 category.

Search for one data point as P3 having C2 category.

Calculate distance between data point P1 and P2 as D12.

Since both P1 and P2 have same value for categorical data. Therefore,


CONCLSION
In real world, almost all dataset contains categorical dataset. So clustering is hard in real time datasets. Improved KMeans algorithm was proposed to remove dependency on K, but it does not work for categorical data. In this paper algorithm to convert categorical data to numerical data is proposed which is used to remove problem for categorical data in improved KMeans algorithm. Proposed algorithm uses distance to calculate value of categories. After calculation of all categories this values can be used as input to improved KMeans algorithm. In this paper proposed algorithm remove dependency on K and work for categorical dataset. For future work efficiency of this algorithm can be improved further by checking different methods to calculate initial centroid

From equation (2),

Now calculate distance between data point P2 and P3 having categories C1 and C2 respectively as D23.
,

Add equation (3) and equation (4),

Subtract equation (3) and equation (4).

Add equation 5 and equation 6
From above equation value of categorical data can be calculate. Repeat above steps till all categorical data values are not foundThese values are used as input for improved KMeans algorithm. Proposed improved algorithm is independent of K and also work for categorical data set.
ACKNOWLEDGMENT
I would like to thank Prof. Kshirsagar M. (M.E coordinator) and Prof. Jaypal P. C. (Project Guide) for their support and their efforts in making sure that sufficient facility were always available. I am indebted to the authors whose works have been referred in completing this paper. I would also like to thank Dr. Kureshi (Principal, Vishwabharti Academy's College Of Engineering) for his guidance and support. I would also like to thank all my M.E classmates for their help to understand topic.
REFERENCES

Anupma Chadha, Suresh Kumar, An Improved KMeans Clustering Algorithm : A Step Forward For Removal Of Dependency on K, 2014 International Conference on Reliability, Optimization and Information Technology – ICROIT 2014, India, Feb 68 2014.

B M Ahamed Shafeeq, K S Hareesha, " Dynamic Clustering of Data with Modified KMeans Algorithm", International Conference on Information and Computer Networks (ICICN 2012), IPCSIT, vol. 27,pages 221225,2012

M. AI Dauod, "A New Algorithm for Cluster Initialization", World Academy of Science, Engineering and Technology, issue 4 ,2007

Mohammed EI Agha, Wesam M. Ashour, " Efficient and Fast Initialization Algorithm for Kmeans Clustering" ,1.1. Intelligent Systems and Applications, vol. 4, issue 1, pages 21 31, 2012

Kohei Arai, Ali Ridho Barakha, "Hierarchical Kmeans: An algorithm for centroids initialization for Kmeans", Reports of the Faculty of Science and Engineering, Saga University, vol. 36, issue. 1, pages 2531, 2007

VLeela, K.Sakthipriya, R.Manikandan, "A comparative analysis between kmean and ymeans Algorithms in Fisher's Iris data sets" , International Journal of Engineering and Technology, vol 5, issue 1, pages 245 249,2013

Stephen J. Redmon, Conor Heneghan, " A method for initializing the Kmeans clustering algorithm using kdtrees", Journal Pattern Recognition Letters, vol. 28, issue 8, pages 965 973,2007