Feasibility of Fuzzy Clustering for Improving the Objective Function of K-Means Clustering

DOI : 10.17577/IJERTV11IS070231

Download Full-Text PDF Cite this Publication

Text Only Version

Feasibility of Fuzzy Clustering for Improving the Objective Function of K-Means Clustering

Shubham Patil

P.G. Student

G. H. Raisoni Institute of Engineering & Management Jalgaon M.S. India

Hiralal Solanki

Assistant Professor

  1. H. Raisoni Institute of Engineering & Management Jalgaon M.S. India

    Abstract:- Clustering is a division of data items into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. The k-Means clustering method is one of the classical and simplest methods for data clustering. It is one of the most widely used methods in practical implementations because of its simplicity. But sometimes the resulting membership values do not always correspond well to the degrees of belonging of the data, so to overcome the problems in hard K-means (KM) clustering, the Fuzzy K-Means (FKM) clustering approach is proposed. fuzzy clustering forms clusters such that data object can belong to more than one cluster based on their membership levels, In the existing system subtraction structure objective function is used, it introduced saw tooth nature in objective function. In this paper feasibility of fuzzy partition matrix of objective function in k-means clustering is proposed, it provides smoothness in saw tooth nature in objective function, which is main reason for the improving the objective function.

    Keywords : Fuzzy clustering, fuzzy partition matrix

    1. INTRODUCTION

      Data clustering is one of the important data mining metho

      e same class and most dissimilarity between different classes. The large number of data objects greatly increases the challenges of comprehending and interpreting the resulting mass of data. A first step toward addressing the challenge is the use of clustering techniques. Clustering is a common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data. It groups the data objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels.

      The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminate analysis (supervised learning).

      The clustering in data mining becomes very difficult because of very large datasets with many attributes of different types. This causes to have unique computational requirements on appropriate clustering algorithms. The main concern for most of clustering algorithms is their need to know the number of clusters for which to look. Since the clustering is an unsupervised way of grouping, the user has no previous knowledge about the actual number of clusters. Apparently, dividing the dataset into smaller or larger clusters will result in merging some separate clusters or breaking down some compact ones. The process of finding an optimal number of clusters is called cluster validity

      In order to achieve the main aim of fuzzy k-means clustering, the drawbacks of traditional k-means clustering are studied. k-means clustering clusters the data in a crisp sense which results into empty clusters.

      Whereas, the proposed Fuzzy K-Means (FKM) clustering uses the membership partition matrix grades in order to express ambiguity in the assignment of data point to clusters. The proposed partition based fuzzy k-means employs fuzzy measures as the basis for membership matrix calculation and for cluster centers identification. The fuzzy measures applied to clustering helps to improve the results of fuzzy k- means clustering.

    2. RELATED WORK

      Fuzzy logic introduced by Zadeh [12] may be viewed as an attempt at formalization of two remarkable human capabilities. The capability to perform a wide variety of physical and mental tasks without any measurements and any computations. In many real world application areas, knowledge is represented by in terms of imprecise linguistic words from a natural language. A linguistic variable means a variable whose values are words or sentences in a natural language or artificial language. For example, honesty is linguistic variable. The linguistic values of this variable can be extremely honest, not honest, sometimes honest, and very honest.

      Fuzzy logic is the way of representing and manipulating data that is not exact, but rather uncertain [11]. Uncertainty can be manifested in many forms: it can be fuzzy (not sharp, unclear, imprecise, approximate), it can be vague (not specific, amorphous), it can be ambiguous (too many choices, contradictory), it can be of the form of ignorance (dissonant, not knowing something), or it can be a form due to natural variability (conicting, random, chaotic, unpredictable).

    3. LITERATURE SURVEY

      Figure 1 shows the structure of literature survey. There are two main domains of data mining Clustering and classification. The clustering is unsupervised learning and classification is supervised learning. Clustering is divided into two categories namely hard clustering and fuzzy clustering. Hard clustering clusters the data in a crisp sense. It means each data object can be a member of one and only one cluster at a time. In Hard clustering there is always at least one object in each cluster. However, the empty clusters can be obtained if not a single object is allocated to a cluster during the assignment. The Fuzzy clustering assigns the data object to more than one cluster at a time.

      Figure 1: Literature Survey

        1. Hard Clustering

          Hard clustering is also known as crisp clustering. Crisp clustering allocate each data pattern (data object) of given input to a single cluster. Thus in hard clustering, each data pattern (data object) belongs to only one cluster. Farley and Raftery, in [17], suggested dividing the clustering methods into two main groups: partitioning and hierarchical methods. Han and Kamber, in [1], suggested categorizing the methods into additional three main categories: density-based methods, model-based clustering and grid-based methods.

        2. Partitioning Clustering

          Partitioning clustering [18] directly divides data objects into some pre-specified number of cluster. The checking for all possible cluster is computationally impractical, certain greedy heuristics are used in the form of iterative optimization of cluster. Researchers have suggested several partitioning clustering approaches viz., K-Means Clustering, K-Medoid Clustering, Relocation Algorithm and Probabilistic Clustering etc.

          K. Tapas et al., in [19], have proposed K-means clustering. K-Means clustering is a method commonly used to automatically partition a data set into clusters (K). Partitioning the objects into mutually exclusive clusters (K) is done by it in such a fashion that objects within each cluster remain as close as possible to each other but as far as possible from objects in other clusters. Each cluster is characterized by its centre point i.e. centroid. The distances used in clustering in most of the times do not actually represent the spatial distances. In general, the only solution to the problem of finding global minimum is exhaustive choice of starting points. The K-Means clustering algorithm finds the desired number of distinct clusters and their centroids. A centroid is the point whose co-ordinates are obtained by means of computing the average of each of the co-ordinates of the points of samples assigned to the clusters. The input parameters of the clustering algorithm are the number of clusters that are to be found along with the initial starting point values. When the initial starting values are given, the distance from each sample data point to each initial starting value is found. Then each data point is placed in the cluster associated with the nearest starting point. After all the data points are assigned to a cluster, the new cluster centroids are calculated. For each factor in each cluster, the new centroid value is then calculated. The new centroids are then considered as the new initial starting values. This process continues until no more data point changes or until the centroids no longer move. In K-Means data object can belong precisely to only one cluster during clustering process. This can be too restrictive while clustering high dimensional data expressed in multiple conditions. Han and Kamber, in [1], have proposed K-medoid Clustering. In the

          K-medoid clustering a cluster is represented by one of its points called medoid. A medoid is the centrally located data point. When medoids are selected, clusters are defined as subsets of points close to respective medoids, and the objective function is defined as the averaged distance or another dissimilarity measure between a point and its medoid. Every time a new medoid is selected, the distance between each object and its newly selected cluster center has to be recomputed. Because there could be obstacles between two objects, the distance between two objects may have to be derived by geometric computations. The computational cost can get very high if a large number of objects and obstacles are involved. Representation by k-medoids has two advantages [9]. First, it presents no limitations on attributes types, and, second, the choice of medoids is dictated by the location of a predominant fraction of points inside a cluster and therefore, it is lesser sensitive to the presence of outliers. P. Berkhin, in [3], have proposed Relocation Algorithms. The relocatopon algorithms iteratively reallocate points between the k clusters. The points are reassigned based on the local search algorithm. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The three changeable elements of the general relocation algorithm are initialization, reassignments of the data points into clusters and update of the cluster parameters. These algorithms builds the high quality clusters due to iterative approach. The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation. I. V. Cadez et al., in [20], have proposed Probabilistic Clustering. In the probabilistic approach, data is considered to be a sample independently drawn from a mixture model of several probability distributions. SO the clusters are associated with the corresponding distributions parameters such as mean and variance.

        3. Hierarchical Clustering

          IndiraPriya and Ghosh, in [21], have proposed hierarchical clustering. Hierarchical clustering creates a hierarchical decomposition of the given set of data objects. It builds a cluster hierarchy, a tree of cluster, also known as a dendrogram. It represents a sequence of nested cluster which constructed top-down or bottom-up. The root of the tree represents one cluster, containing all data points, while at the leaves of the tree, there are n clusters, each containing one data point. By cutting the tree at a desired level, a clustering of the data points into disjoint groups is obtained. A hierarchical clustering is used to find data on different levelsof dissimilarity. A hierarchical clustering can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.

        4. Density-Based Algorithms

          P. Berkhin, in [3], have proposed density-based algorithms. Density-based algorithms are capable of discovering clusters of arbitrary shapes. These algorithms group objects according to specific density objective functions. Density is usually defined as the number of objects in a particular neighborhood of a data objects. In these approaches a given cluster continues growing as long as the number of objects in the neighborhood exceeds some parameter.

        5. Model-Based Clustering

          Han and Kamber, in [1], have proposed model-based clustering methods. Model-based clustering hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of clusters based on standard statistics, taking noise or outliers into account and thus yielding robust clustering methods. Expectation- Maximization (EM) is an algorithm that performs expectation-maximization analysis based on statistical modeling. Cobweb is a conceptual learning algorithm that performs probability analysis and takes concepts as a model for clusters. Self-Organizing feature Map (SOM) is a neural network-based algorithm that clusters by mapping high dimensional data into a 2-D or 3-D feature map, which is also useful for data visualization.

        6. Grid-Based Clustering

          Grid-based clustering [1] quantize the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized space). The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. Some typical examples of the grid-based approach include STING (Statistical Information Grid) , which explores statistical information stored in the grid cells; WaveCluster, which clusters objects using a wavelet transform method; and CLIQUE (CLustering InQUEst) , which represents a grid-and density-based approach for clustering in high-dimensional data space. Grid based clustering create a grid structure by partitioning the data space into a finite number of non-overlapping cells then calculate the cell density for each cell. After calculating density grid based clustering sort the cells according to their densities. Cluster centers are identified and all neighbour cells are traversed.

          N. H. Park and W. S. Lee, in [23], presented statistical grid- based clustering over data streams. A data stream is a large unbounded sequence of data elements continuously generated at a rapid rate. The approach used is statistical grid-based approach for clustering data elements of data streams. First, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. A cluster of a data streams is group of adjacent dense unit cells.

        7. Fuzzy Clustering

          Fuzzy clustering is the synthesis between the fuzzy logic and clustering which is the requirement of modern computing [12]. The aim of fuzzy clustering is to model the ambiguity within the unlabeled data objects efficiently. Every data object is assigned a membership to represent the degree of belonging to certain class. The requirement that each object is assigned to only one cluster is relaxed to weaker requirement in which the object can belong to all of the

          clusters with a certain degree of membership. Thus it assigns degrees of membership in several clusters to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to cluster with the largest measure of membership. Soft clustering is categorized in three categories [15]: Fuzzy relation based clustering, fuzzy rule based clustering, and objective function based clustering.

        8. Fuzzy Relation Based Clustering

      M. S. Yang, in [15], have proposed fuzzy relation based clustering. Fuzzy relation based clustering includes an N- step procedure by using the composition of fuzzy relations beginning with a reflexive and symmetric fuzzy relation R in X. The data set is partitioned into the number of cluster by equivalence relation. G. S. Liang et. al., in [24], have introduced cluster analysis based on fuzzy equivalence relation. The approach used is the distance measure between two trapezoidal fuzzy numbers is used to aggregate subjects linguistic assessments. The distance measure is used to characterize the interobjects similarity. The linguistic assessment is for attributes ratings to obtain the compatibility relation. Then a fuzzy equivalence relation based on fuzzy compatibility relation is constructed.

    4. FEASIBILITY OF FUZZY CLUSTERING FOR IMPROVING THE OBJECTIVE FUNCTION OF K-

      MEANS CLUSTERING

      The proposed solution focuses on text clustering using fuzzy logic based clustering in order to facilitate and improve effectiveness in an conventional hard clustering approach. Fuzzy clustering is a partition based clustering scheme and is particularly useful when there are no apparent clear groupings in the data set [34]. Partitioning schemes provide automatic detection of cluster boundaries and in case of fuzzy clustering, these cluster boundaries overlap. Every individual data entity belongs to not one but all the clusters with varying degrees of membership.

      The proposed system preprocess the data, then the preprocessed data is given as an input to the conventional hard clustering algorithm and proposed fuzzy portioning algorithm. Finally, the cluster formation is done and the results of both the hard clustering and fuzzy clustering are compared.

      Hard partition is insufficient to represent many real situations. Therefore, a fuzzy clustering method is offered to construct clusters with uncertain boundaries. Hence, this method allows that one object belongs to some overlapping clusters to some degree.

      Fuzzy clustering is a partition based clustering scheme and is particularly useful when there are no apparent clear groupings in the data set [34]. Partitioning schemes provide automatic detection of cluster boundaries and in case of fuzzy clustering, these cluster boundaries overlap. Every individual data entity belongs to not one but all the clusters with varying degrees of membership.

      Conditions for a fuzzy context partition matrix, are suggested by relaxing the constraint, this constraints suggest that each target (outlier) word is assigned to at least one of the fuzzy sense cluster with membership greater than zero

      [0, 1], 1 , 1 ,

      , > 0,

    5. CONCLUSION AND FUTURE WORK

      The Proposed FKM formulates the objective function in terms of improving the membership assignments of an object. It assigns fuzzy memberships to data object and

      0 <

      =1

      < N, 1

      updates the centre of cluster according to the assigned

      FCM calculates the distance between cluster center to the data point and assigns flexible membership to each data point. More the data is near to the cluster center more is its membership towards the particular cluster center. Summation of membership of each data point should be equal to one. FCM uses fuzzy portioning approach, fuzzy partitioning is carried out through the iterative procedure

      memberships. The assigned memberships play a role as

      weight values which represent the degree to which data object belongs to more than one clusters. The degree of belongingness depends on the selection of Fuzziness Factor. The Proposed Fuzzy K-Means significantly differ depending on the choice of Fuzziness Factor. In hard K- Means a set of k initial cluster centers is chosen arbitrarily

      that updates the membership and cluster center

      equations 5.1 d and 5.1 e respectively.

      N

      c j by

      and each object is then assigned to the center closest to it,

      and the centers are recomputed. This is repeated until the process stabilizes which takes more execution time and memory. On the other hand, in Proposed Fuzzy K-Means

      m

      uij xi

      approach though it assigns membership to an object which

      , = 1,2, . ,

      c

      i 1

      u

      j N

      m

      ij

      i

      1

      1

      is inversely related to the relative distance of the object to

      cluster centre.

    6. BIBLIOGRAPHY

[1] J. Han and M. Kamber, Data mining: concepts and techniques," 2001.

Where

uij

2

C || xi c j || m1

k 1 || xi ck ||

[2] C. C. Aggarwal and C. K. Reddy, Data clustering: algorithms and applications.

[3] P. Berkhin, Survey of clustering data mining techniques," San Jose, CA, 2002,

[4] O. M. Jafar and R. Sivakumar, A comparative study of hard and

n = number of data point

= is the cluster center

m = fuzziness index m [1, ]

c = number of cluster center

= member of data to cluster

|| || = Euclidean distance

4.1 Architecture

The architecture of the proposed system is shown in Figure

2. Input to the proposed system is the text data set [36].

Figure 2: Architecture of Proposed System

In fuzzy clustering, a data object will have an associated degree of membership for each cluster, indicating the strength of its association in that cluster. It iteratively update the membership values of a data object with the pre-defined number of clusters. Thus, a data object can be the member of all clusters with the corresponding membership values. The process of calculation of cluster centers and the assignment of points to these centers is repeated until the cluster centers stabilize [37].

fuzzy data clustering algorithms with cluster validity indices," in

Proceedings of the Elsevier International

[5] J. Daxin, C. Tang, and A. Zhang, Cluster analysis for gene expression data: a survey", IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370.

[6] A. Baraldi and P. Blonda, A survey of fuzzy clustering algorithms for pattern recognition," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 29, no. 6, pp. 778-785, october 1998.

[7] J. Hu, C. Xiong, J. Shu, X. Zhou, and J. Zhu, A novel text clustering method based on tgsom and fuzzy k-means" , in Proceedings of the 2009 First International Workshop on Education Technology and Computer Science – Volume 01, ser. ETCS '09, 2009, pp. 26-30.

[8] C. Wu, C. Ouyang, L. nan Chen, and L. Lu, A new fuzzy clustering validity index with a median factor for centroid based clustering", IEEE Transactions on Fuzzy Systems, vol. 23, no. 3, June 2015.

[9] L. Rokach and O. Maimon, Clustering methods," in Data mining and knowledge discovery handbook. Springer, 2005, pp. 321-352.

[10] N. A. M. Isa, S. Salamah, U. K. Ngah et al., Adaptive fuzzy moving k-means clustering algorithm for image segmentation," IEEE Transactions on Consumer Electronics, vol. 55, no. 4, pp. 2145-2153, 2009.

[11] T. J. Ross, Fuzzy logic with engineering applications, 2nd ed. John Wiley & Sons, 2009.

[12] L. A. Zadeh, Fuzzy sets," Information and Control, vol. 8, pp.

338-353, 1965.

[13] Zadeh, Is there a need for fuzzy logic?" Information sciences, vol.

178, no. 13, pp. 2751-2779, 2008.

[14] L. Zadeh, C. Negoita, and H. Zimmermann, Fuzzy sets as a basis for a theory of possibility," Fuzzy sets and systems, vol. 1, pp. 3- 28, 1978.

[15] M. S. Yang, A survey of fuzzy clustering," Mathematical and Computer modelling, vol. 18, no. 11, pp. 1-16, 1993.

[16] Q. Ni, Q. Pan, H. Du, C. Cao, and Y. Zhai, A novel cluster head selection algorithm based on fuzzy clustering and particle swarm optimization," IEEE/ACM Transaction on Computational Biology and Bioinformatics, no. 99, pp. 1{9, 2015.

[17] C. Fraley and A. E. Raftery, How many clusters? which clustering method? Answers via model-based cluster analysis," The computer journal, vol. 41, no. 8, pp. 578-588, 1998.

[18] S. Ayramo and T. Karkkainen, Introduction to partitioning based clustering methods with a robust example," Reports of the Department of Mathematical Information Technology Series C. Software and Computational Engineering, 2006.

[19] K. Tapas, D.M.Mount, N. Netanyahu, C. Piatko, R. Silverman, and A.Y.Wu, An efficient k-means clustering algorithm: analysis and implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July 2002.

[20] I. V. Cadez, S. Ga_ney, and P. Smyth, A general probabilistic framework for clustering individuals and objects," in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2000, pp. 140-149.

[21] P. IndiraPriya and D. Ghosh, A survey on different clustering algorithms in data mining technique," International Journal of Modern Engineering Research (IJMER), vol. 3, no. 1, pp. 267-274, 2013.

[22] M. W. Berry and M. Castellanos, Survey of Text Mining:Clustering, Classification and Retrieval, 2nd ed. Springer, 2007.

[23] N. H. Park and W. S. Lee, Statistical grid-based clustering over data streams," ACM SIGMOD Record, vol. 33, no. 1, pp. 32-37, 2004.

[24] G.-S. Liang, T.-Y. Chou, and T.-C. Han, Cluster analysis based on fuzzy equivalence relation," European Journal of Operational Research, vol. 166, no. 1, pp. 160-171, June 2004.

[25] M.-S. Yang and H.-M. Shih, Cluster analysis based on fuzzy relations," Fuzzy Sets and Systems, vol. 120, no. 2, pp. 197{212, 2001.

[26] E. G. Mansoori, Frbc: a fuzzy rule-based clustering algorithm," IEEE Transactions on Fuzzy Systems, vol. 19, no. 5, pp. 960-971, 2011.

[27] M. Delgado, A. F. G_omez-Skarmeta, and F. Martin, \A fuzzy clustering-based rapid prototyping for fuzzy rule-based modeling," IEEE Transactions on Fuzzy Systems, vol. 5, no. 2, pp. 223-233, 1997.

[28] Y. Lu, T. Ma, C. Yin, X. Xie, W. Tian, and S. Zhong,

\Implementation of the fuzzy c-means clustering algorithm in meteorological data," International Journal of Database Theory and Application, vol. 6, no. 6, pp. 1-18, 2013.

[29] P. Lingras and G. Peters, Applying rough set concepts to clustering," in Rough Sets: Selected Methods and Applications in Management and Engineering. Springer, 2012,

[30] N. R. Pal and J. C. Bezdek, On cluster validity for the fuzzy c- means model," IEEE Transactions on Fuzzy Systems, vol. 3, no. 3, pp. 370-379, 1995.

[31] P. Maji and S. Paul, Rough-fuzzy clustering for grouping functionally similar genes from microarray data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, no. 2, pp. 286-299, March 2013.

[32] O. Sutton, Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction," University lectures,

University of Leicester, 2012

[33] R.-P. Li, M. Mukaidono, and I. B. Turksen, A fuzzy neural network for pattern classification and feature selection," Fuzzy Sets and Systems, vol. 130, no. 1, pp. 101-108,

[34] L. Zhu, F.-L. Chung, and S. Wang, Generalized fuzzy c-means clustering algorithm with improved fuzzy partitions," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 3, pp. 578-591, 2009.

[35] Y. Karali, D. Kodamasingh, and R. L. H. Behera, Hard and fuzzy clustering algorithms using normal distribution of data points: a comparative performance analysis," in International Journal of Engineering Research and Technology, vol. 2, no. 10 (October- 2013).

[36] Enron email dataset," https://www.cs.cmu.edu/ ./enron/ [Accessed on: May 2015].

[37] W. Pedrycz and H. Izakian, Cluster-centric fuzzy modeling," IEEE Transactions on Fuzzy Systems, vol. 22, no. 6, pp. 1585- 1597, December 2014.

[38] M. F. Porter, \An algorithm for suffix stripping," Program, vol. 14, no. 3, pp. 130{137, 1980.

[39] C. Silva and B. Ribeiro, The importance of stop word removal on recall values in text categorization," in Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 3, 2003, pp. 1661-1666.

[40] M. Ramaswami and R. Bhaskaran, A study on feature selection techniques in educational data mining," journal of computing, vol. 1, no. 1, pp. 7-11, December 2009.

[41] S. Beniwal and J. Arora, Classification and feature selection techniques in data mining," International Journal of Engineering Research & Technology (IJERT), vol. 1, no. 6, pp. 1-6, 2012.

[42] A. K. Murugesan and B. J. Zhang, A new term weighting scheme for document clustering," in 7th International Conference on Data.

[43] J. W. Reed, Y. Jiao, T. E. Potok, B. Klump, M. T. Elmore, A. R. Hurson et al., Tf icf: A new term weighting scheme for clustering dynamic data streams," in proceedings of 5th IEEE International Conference on Machine Learning and Applications, 2006.

[44] G. Salton, A. Wong, and C.-S. Yang, A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975.