 Open Access
 Total Downloads : 106
 Authors : Mr. Yuvraj S. Sase, Mr. M. C. Kshirsagar
 Paper ID : IJERTV5IS070484
 Volume & Issue : Volume 05, Issue 07 (July 2016)
 DOI : http://dx.doi.org/10.17577/IJERTV5IS070484
 Published (First Online): 27072016
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Dissimilarity Calculation Approach for Categorical Data Set
Mr. Yuvraj S. Sase 1
ME Second Year (Computer Engineering) Vishwbharati Academy College of Engineering, Ahmednagar, Pune University
Mr. M. C. Kshirsagar 2
Assistant Professor,
Vishwbharati Academy College of Engineering, Ahmednagar, Pune University
Abstract KMeans algorithm, which is probably most popular technique that solve the wellknown clustering problem. The procedure follows the process of partitioning a group of data points into small number of clusters. The main limitation of algorithm is input required as number of clusters. It is very hard for user who is unknown to data set to calculate K value. There is no thumb rule for calculating K value, its complete trial and error method and very inefficient for any kind of user. An Improved KMeans algorithm [1] removes this problem of K Means algorithm as it does not take any kind of input from user and decides K value in process automatically. But still an improved KMeans algorithm has problem with categorical data set. An improved KMeans algorithm fails with categorical data set. A proposed algorithm is addition to improved KMeans algorithm to solve problem of An improved KMeans algorithm. An proposed algorithm use dependant attributes to calculate dissimilarity between categorical attribute which further used with numerical data set final clustering result.
Keywords Clustering; Objects; Cluster Mean; Outliers; Dependent Attributes

INTRODUCTION
Today every sector is moving to digital world. Thus they are generating huge amount of data. Knowledge discovery from this data is necessary for improvement in respective sector. Clustering is one of method to gain useful patterns from data set. Clustering is a data mining technique which helps in grouping or making clusters of data by considering their similarity and dissimilarity. There are many clustering algorithms proposed yet. The most popular one is KMeans algorithm, because of its simplicity and effectiveness against large data sets. KMeans algorithm uses Euclidean distance formula to decide dissimilarity between data objects. But K Means algorithm has many problems like difficulty in finding K value, initial cluster selection etc.
An improved KMeans [1] algorithm is solution for all problems arises in KMeans algorithm. An improved K Means algorithm removes the most basic problem of K Means algorithm i.e. dependency on K value. This algorithm does not take K as input from user. K value is mainly calculated dynamically. This algorithm use outliers to find K value. Outliers are mainly decided on objective function. If any object satisfies conditions of objective function then that object is not outlier. Otherwise it is outlier for that cluster. This algorithm finds all outlier and then assigns new cluster to them. Next, all points are adjusted in the new formed clusters according to minimum distance and all process repeated till change happens in clusters. An improved K Means algorithm solve dependency problem of KMeans algorithm. But improved KMeans algorithm does not works
on categorical data set. The data type which can be divided into categories is called as categorical data. Such kind of categorical data has no numerical values instead they have groups. There is no way of calculating dissimilarity between these groups which leads to infertile environment for clustering.
In this paper, the proposed algorithm can find dissimilarity between categorical attributes. This algorithm uses distance equations to find out category attribute value. The proposed algorithm has two kinds input data sets, first is data set including categorical attributes which needed to be grouped in clusters and second one is data set containing dependant attributes on categorical attribute.

LITARATURE SURVEY AND RELATED WORK

Categorical Data
The data type which can be divided into categories is called as categorical data. For example,

Age Group

Colours

Gender
Categorical data attributes are really very common in real life situations. Every sector is generating huge amount of categorical data. Categorical data is also data that is collected in an either/or, yes/no fashion. Categorical data is also divided into three types,

Dichotomous Data: Categorical attribute which has only two values are called as Dichotomous data. Dichotomous variables are designed to give you an either/or response.

Ordinal Data: Categorical attribute which has two or more categories in order is called as Ordinal data.

Nominal Data: Categorical attribute which has two or more categories but not in order is called as Nominal Data.
Fig. 1. Skeleton Diagram of Proposed System


Dependant and Independant Attribute
The dependent variable is simply that; a variable that is dependent on an independent variable(s) for example, student performance. Student performance depends on study time of student and natural intelligence of student. That means if change in these two variables will change student performance. In proposed algorithm dependant attribute has very crucial part in calculating dissimilarity of category attribute.


PROPOSED ALGORITHM
The proposed algorithm can find dissimilarity between categorical attributes. This algorithm uses distance equations to find out category attribute dissimilarity. The proposed algorithm has two kinds input data sets, first is data set including categorical attributes which needed to be grouped in clusters and second one is data set containing dependant attributes on categorical attribute. The proposed system is built to calculate values for each and every category tuple in each categorical attribute. This algorithm uses distance equations to find out category attribute value. Figure shows skeleton diagram of proposed system.

Algorithm
Steps of algorithm are as follow.
1: suppose c = c1, c2… cn be the single category feature and n is no of category tuple.
2: Select dependant attributes for categorical attributes. 3: Select any two category tuples.
4: Select 4 smallest values of dependent attributes for first category tuple as n1, n2, n3, n4 and two biggest values for second category tuple as n5.
5: Find distances between these values as,
6: Find distance equations as,
7: Adding eq1 and eq2 give final dissimilarity between selected categories that is C12.
8: Repeat from step 3 till get dissimilarity between all category tuples.
9: Apply improved k means algorithm on final dataset with categories values.


IMPLEMENTATION ENVIRONMENT
The proposed system achieves the code separation using MVC design pattern. In MVC pattern M stands for Model, V stands for View and C stands for Controller. Model basically handle all business logic, in this case all algorithm steps like selecting categories, calculating distances, finding means, dividing data into clusters etc. Model is most essential part of system. Views are nothing but graphical interfaces which user going see and interact with system. View separates graphical interfaces from business logic, so graphical interface can be changed any time without changing business logic. Finally controller is connection between model and view. The selected implementation environments for proposed system are:

PHP

PHPDesktop

Apache Server

MySQL


MATHEMATICAL MODELLING

Dissimilarity in Dichotomous Variable
Dichotomous variables are designed to give an either/or response. As it has only two values, variables either different or same. So they add same amount of dissimilarity percentage in cluster formulation. Let c1 and c2 be two dichotomous categorical tuples. Then distance d (c1, c2) is as shown in following equation
d (c1, c2) = 0/1
DISCUSSION ON PERFORMANCE Performance testing is done using XDebug and Web grind
tool. XDebug is a PHP extension which provides debugging
and profiling capabilities. Web grind is an XDebug profiling web frontend in PHP5. Performance of proposed system is as shown in fig.
Fig. 2. Performance of Proposed System

Dissimilarity in Ordinal Variable
This kind of variable has multiple values but in predefined order. As it is in order difference between two simultaneous variables is always same. The difference between two ordinal variables is always multiplication of difference between two simultaneous variable and difference in their order.
Let c1 and c2 be two ordinal categorical tuples. d(s) be the difference between two simultaneous tuples and d(o) be the difference between order of c1 and c2. The distance d(c1; c2) is as shown in equation
d (c1, c2) = d(s) * d(o)

Dissimilarity in Nominal Variable
This kind of variable has multiple categories also not in order. So there is difficulty in calculating difference between nominal variables. In this case we use dependant variables on categorical attribute. Dependant attribute directly represent value of categorical data.
Let c1 and c2 be two nominal categorical tuples. tdep1 and tdep2 be the corresponding dependant variables. The distance d(c1, c2) is as shown in equation
d (c1, c2) = (tdep1 – tdep2)
CONCLUSION
This paper has seen An Improved KMeans Algorithm and its problems. This paper has seen the importance of categorical data as it is really common in real life situations. This paper has seen importance of clustering categorical data. The proposed algorithm calculated dissimilarity between categorical attributes which further used to formulate final clusters. This paper seen result generated by proposed system and its performance. But proposed system sill have several issues like accurate dependency calculation, performance issue etc. There are many issues could be seen in future.
ACKNOWLEDGMENT
I would like to sincere thanks to the peoples who support and help in this work. Firstly Mr M. C. Kshirsagar my guide, all other teaching staff, my friends and my family thank for great help from beginning to end.
REFERENCES

Anupama Chadha and Suresh Kumar, " An Improved KMeans Clustering Algorithm: A Step Forward for Removal of Dependency on K" 2014 International Conference on Reliability, Optimization and Information Technology, Feb 68 2014

Zhexue Huang and Michael K. Ng, "A Fuzzy kModes Algorithm for Clustering Categorical Data " IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 7, NO. 4, AUGUST

I.H. Jarman, T.A. Etchells, P.J.G. Lisboa and C.M Beynon, "Clustering CategoricalData: A Stability Analysis Framework" IEEE , 2011.

Dadong Yu, Dongbo Liu, Rui Luo, JianxinWangl, "Clustering Categorical Data Based on Maximal Frequent Itemsets," IEEE Sixth International Conference on Machine Learningand Applications, 2007.

B.Suresh Kumar, H.Venkateswara Reddy, T.Ankamma Raju, Preethi Vennam, "Clustering Categorical Data Using Rough Membership Function," 2014 Sixth International Conference on Computational Intelligence and Communication Networks, 2014.

ZHEXUE HUANG , "Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery 2, 283304 (1998).

Zhexue Huang , "Clustering Large Data Sets With Mixed Numeric And Categorical Values*," IJCA,

Madhu Yedla, Srinivasa Rao Pathakota, T M Srinivasa, "Enhancing K means Clustering Algorithm with Improved Initial Center," International Journal of Computer Science and Information Technologies, Vol. 1 (2) , 2010, 121125.

Ahamed Shafeeq B M and Hareesha K S , "Dynamic Clustering of Data with Modified KMeans Algorithm," 2012 International Conference on Information and ComputerNetworks (ICICN 2012)..

Zengyou He, Xiaofe i Xu, Shengchun Deng, "Clustering Mixed Numeric and Categorical Data : A Cluster Ensemble Approach "High Technology Research and Development Program.

Pramod Nutan Dhara, and Rishi Sayal, "A Comparative Analysis Beetween KMeans And YMeans Algorithms In Fishers Iris Data Sets," International Journal of Engineering and Technology (IJET), Vol 5 No 1 FebMar 2013 245.