Comparison Between K-Means and Genetic Algorithm in Text Document Clustering

Divyashree G; Gayathri Rayar

doi:10.17577/IJERTCONV3IS14012

NCACCT - 2015 (Volume 3 - Issue 14)

Comparison Between K-Means and Genetic Algorithm in Text Document Clustering

DOI : 10.17577/IJERTCONV3IS14012

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 459
Total Downloads : 18
Authors : Divyashree G, Gayathri Rayar
Paper ID : IJERTCONV3IS14012
Volume & Issue : NCACCT – 2015 (Volume 3 – Issue 14)
Published (First Online): 24-04-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comparison Between K-Means and Genetic Algorithm in Text Document Clustering

Divyashree G

Assistant Professor

Dept of Information Science & Engineering Sapthagiri College of Engineering Bangalore, India

Gayathri Rayar

Assistant Professor

Dept of Information Science & Engineering Sapthagiri College of Engineering Bangalore, India

Abstract In todays business practice through the means of IT, there are different data sources for a particular object. More precisely it can be said that the documents are large. Specific text document clustering method based on text mining concept can help to analyze and monitor the data sources. The prime objective of optimization of clustering algorithms is to achieve high intra-cluster similarity (i.e. documents within a cluster should be similar) and low inter-cluster similarity (i.e. documents from different clusters should be dissimilar). The clustering is the core technology into machine learning, pattern recognition, image analysis and information retrieval system based application. The existing works such as cosine similarity, Jaccard, Pearson Coefficient and K-Means algorithm will be optimized by using the genetic algorithm in the proposed work. The performance metric such as purity, entropy and F-measure will be evaluated for the K-means clustering algorithm and genetic algorithm and the final result is expected to posses higher score of purity, lower score of entropy and maximized F- measure value.

Keywords – Genetic Algorithm, K-Means Clustering Algorithm, Similarity Measures, Cosine Similarity, Jaccrd Coefficient, Pearson Coefficient.

INTRODUCTION

Data mining is known as Knowledge Discovery in Databases (KDD).Data mining is a process of analyzing large databases to find patterns that are valid, useful, and understandable. The valid means holds the new data with some certainty and useful means data mining should be able to act on the terms in the document finally, the understandable means humans should be able to read/identify the pattern. Data mining performed with large data, heterogeneous machine learning, statistics, artificial intelligence, databases and visualization.

Text mining is a part of data mining its aim is to extract high-quality information from the given text. The extraction of high quality information can be done through statistical pattern learning. text mining includes information retrieval, lexical analysis, pattern recognition, information extraction, data mining techniques, association analysis, visualization, and predictive analytics.

Cluster analysis or clustering is the process of grouping a set of objects in such a way that objects which are more similar are grouped under single clusters and the objects which are not similar are grouped under other clusters. It is the main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Document clustering, also known as text clustering is a technique, used to group the documents automatically. It is useful in organizing the documents based on indexing, that is helpful for sorting and quick searching.

Example:
- Apple can be clustered into the categories of fruits, as well as mobile companies
- An email received by a company, with subject line containing problem, can be parsed into a separate folder, and to be addressed by the customer care. Document clustering is a document organization,
extraction of the terms and fast information retrieval or filtering. Document clustering uses the concept of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Clustering is an example of unsupervised classification [1]. Classification of the data is done based on the procedure that assigns data objects to a set of classes. Unsupervised means clustering does not depend on any predefined classes and data training examples while classifying the data objects. Clustering is a crucial area of research, which finds applications in many fields including bioinformatics, pattern recognition, image processing, marketing, data mining, economics, etc. An example of document clustering is web document clustering for data searching by users. The application of document clustering can be classified into two types, online and offline. Online applications are constrained by efficiency problems when compared offline applications.

Document Clustering is a challenging task and it is being studied from many decades but still it is far from a trivial and solved problem. The some of the challenges among them are:
1. Selecting appropriate features of the documents that should be used for clustering.
2. Selecting an appropriate similarity measure between the documents.
3. Selecting an appropriate clustering method utilizing the above similarity measure.
4. Implementing the clustering algorithm in an efficient way that makes it feasible in terms of required memory and CPU resources.
5. Finding ways of assessing the quality of the performed clustering.
Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. However, the bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. However, in spite of a long tradition of research in similarity-based text document retrieval there is no clustering method that could function as a best to this end. The reason to stem from the fact that text document clustering must simultaneously deal with quite a number of problems [3]:
1. Problem of efficiency: Text document clustering must be efficient because it should be able to do clustering on ad-hoc collections of documents, e.g. ones found by a search engine through keyword search.
2. Problem of effectiveness: Text document clustering must be effective, i.e., it should relate documents that talk about the same or a similar domain.
3. Problem of explanatory power: Text document clustering should be able to explain to the user why a particular result was constructed. Lack of understandability may pose a much bigger threat to the success of an application that employs text document clustering than a few percentage points decrease in accuracy.
4. Problem of user interaction and subjectivity: Applications that employ text document clustering must be able to involve the user. The results should be focusing ones attention on particularly relevant subjects. For example a search for health might turn up food-related issues that a user might want to explore in details relevant for him, such as fruits, meat, vegetables and others.
  
  The rest of the paper describes as follows: Literature Survey describes some of the current knowledge related to the text document clustering as well as theoretical and methodological contributions to a clustering method, the
  
  Methodology brief about the systematic, theoretical analysis of the methods applied to a document clustering, the architecture of the system will brief the overall work of the system. Then the is described in the document pre- processing module. The similarity measures are calculated for the documents are explained in the similarity measure module. brief about the documen clustering by using the K- Means algorithm Chapter7 tells the evaluation metrics used in the project and also hoe they are used and tabulate the results obtained. Chapter 8 explains the global genetic clustering algorithm works for text document clustering. The testing of these modules is shown in the chapter 9. The results are shown and written in the Chapter 10.
RELATED WORK
- The work has a lack of efficient feature selection and
  
  representing the terms [16].
- In the text document clustering work there are three components that affect the final results, they are representation of the terms, distance or similarity measures, and the clustering algorithm itself [14].
- In the documents clustering using k means algorithm work can be further improved by using similarity measure of documents which would ultimately provide better clusters for a given set of documents and the performance of the algorithm can still be improved by blending it with any standard optimization technique [15].
OVERVIEW OF THE PROPOSED SYSTEM
1. Methodology of the proposed system
  
  Figure 1.3 shows the overall methodology of the proposed efficient document clustering system on centralized system.
  
  The different steps in the proposed methodology are the following,
  - Only standard text document data are selected and upload to each of the system for clustering.
  - The collected text documents are pre-processed using different techniques such as removal of stop words and stemming the words in each system. Stop words are the non-descriptive words such as a, and, are, then, what, is, the, do etc. are removed from the each of the document. Stemming the words means words with different endings will be mapped into a single word Ex: production, produce, product, produces will be mapped to the stem produc to reduce the appearance of same words with different forms. After pre-processing steps are performed now, the data are ready to calculate the similarity measures.
  - Initially calculate the term frequency (TF) and inverse document frequency (IDF) for each document. Term frequency tells occurrence of the word in the document and inverse document frequency represents the weight of word means to tell how important the word in the document.
  - The similarity measures such as Cosine similarity, Jaccard coefficient and Pearson correlation coefficient is applied to the pre-processed document collection in the system.
  - Apply the K-Means clustering algorithm to each of the calculated similarity values at each local system. K-Means algorithm produces the number of clusters based on the similarity value and sends the clustered data to the global system.
  - The clustered data are received at the global system and apply Genetic algorithm for the document clustering. Genetic algorithm apply the step by step process and perform the clustering operation on the text documents
  - Performance factors/metrics such as purity, entropy and F-Measure are used to analyze the proposed document clustering. Quality of the cluster is calculated by purity and entropy whereas the accuracy of the cluster is calculated by F-Measure.
    
    Figure 1.3: Flow Diagram for the document clustering
2. Architecture of the system
The proposed system architecture where it can be seen that after collecting the text document, it is subjected for pre-processing which will usually include truncation of stop- words, stemming of words, and followed by filtering process. The pre-processed text document then subjected to calculate the similarity measures such as Cosine similarity, Jaccard co- efficient and pearson correlation coefficient. The system also considers design of clustering where K-Means algorithm at the local system and genetic algorithm at the local system will be implemented. Finally, after document clustering, the final parameters such as purity, entropy and F- Measure of performance analysis will be extracted to identify the efficient of the proposed architecture.

The architecture of the proposed framework for efficient document clustering on centralized system is as shown in Figure 3.2

Figure 3.2: Architecture of the proposed system

The architecture of the document clustering consists of following five modules
1. Document Pre-processing Module
2. Similarity Measures Module
3. Local document clustering Module
4. Global document clustering Module
5. Evaluation Metric Module
  1. Document Pre-processing Module
    
    The text document pre-processing can be done by the following process shown in the figure 4.1.
    
    Text document collection: Selecting and accessing the data from the system to perform the document clustering. The collected document should be in the .txt format.
    
    Text Document Preprocessing: Initially the collected text documents are composed of a lot of elements or the words. Preprocessing requires the reduction in the document contents.
    
    Figure 4.1: Text document pre-processing
    
    Text document Collection includes the processing of data like indexing, filtering etc which are used to collect the documents that need to be clustered, index them to store and retrieve in a better way, and filter them to remove the extra data, for example, stop words.
    
    Removal of Stop Word: Stop words are the words that are non-descriptive for the topic of a document such as a, and, are, then, what, is, the, do etc. It is the frequently occurring words that are not searchable. This is done to improve the speed and memory consumption of the application. There are standard stop word lists available but in most of the applications these are modified depending on the quality of the dataset. The syntax to remove the stop words is
    
    Stemming the words means words with different endings will be mapped into a single word or Stemming is the process of reducing words to their stem or root form. For example cook, cooking, cooked are all forms of the same word used in different constraint and but for measuring similarity these should be considered same and production, produce, product, produces will be mapped to the stem produc.
    
    Preprocessed data preprocessing consists of steps that take as input a plain text document and output a set of tokens to be included in the vector model.
    
    After pre-processing, the pre-processed data is subjected to calculate the tfidf (term frequencyinverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
    
    TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different
    
    in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
    
    TF (t) = (Number of times term t appears in a document) / (Total number of terms in the document)
    
    IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
    
    IDF (t) = loge (Total number of documents / Number of documents with term t in it)
  2. Similarity Measures Module
A similarity measure or similarity function is a real valued function that quantifies the similarity between two objects. The similarity measure gives the degree up to which each objects are close to or separate from each other. This module performs the calculation of the Cosine similarity, Jaccard coefficient and Pearson correlation coefficient

A variety of similarity or distance measures have been proposed and widely applied, such as the cosine similarity,

Jaccard coefficient and Pearson correlation coefficient.
1. Cosine Similarity
  
  The similarity of two documents corresponds to the correlation between the vectors, where the documents are represented as term vectors. This is quantified as the cosine of the angle between vectors, which is called the cosine similarity. An important property of the cosine similarity is its independence of document length .The result of the cosine similarity lies between 0-1. If the cosine similarity between the two documents is 1 then, the documents are similar. If the cosine similarity between the two documents is 0 then, the documents are not similar. The mathematical formula to calculate the cosine similarity is given as shown in equation
  
  Where, A- Term count value in the document1
  
  B- Term count value in the document2
2. Jaccard Coefficient
  
  The Jaccard coefficient, also known as Tanimoto coefficient, measures similarity as the intersection divided by the union of the objects. For text document,
  
  the Jaccard coefficient compares the sum weight of shared terms to the sum weight of terms that are present in either of the two documents but are not the shared terms. The value of the Jaccard coefficient exists in the range of 0-1.The value 1 means the objects are similar and the value 0 means the documents are different. The formal definition is in the equation
  
  Where, A- Term count value in the document1 B- Term count value in the document2
3. Pearson correlation coefficient
Pearsons correlation coefficient is another measure of the extent to which two vectors are related. There are different forms of the Pearson correlation coefficient formula. The value of this measure lies between the 0 to
1. The value is 1 when the number of terms present in the document1 is equal to the number of terms present in the document2. Given the term set T = {t1, . . . , tm}, a commonly used in the form is equation
  
  = ( )( )
  
  ( )2 ( )2
  
  all data objects remain in the same cluster after an update of centroid value.
  
  Some of the properties if the K-Means algorithms are
  - There should be always K clusters.
  - Each cluster should contain atleast one item in it.
  - The clusters are non-hierarchical and they do not overlap each other.
  - Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'centroid value' of clusters.
The k-means algorithm for the text document data is as explained below. Consider, each clusters center is represented by the mean value of the objects in cluster.

INPUT: K – The number of clusters

D – Set of documents similarity value OUTPUT: A set of K- clusters

Begin
1. Choose k objects from D as initial cluster center.
2. Assign each object to the cluster based on the closest centroid.
3. Calculate the mean value of the objects for each cluster and update.
4. Repeat the step 2 and 3 until there is no change in the cluster center.
End.

4. Global document clustering Module

Where, xi Term count value of the document1 yi Term count value of the document2

xÂ¯ – Mean term count value of the document1 yÂ¯ – Mean term count value of the document2

3. Local document clustering Module

The local text document clustering is done by using the K-Means algorithm.K-Means is a simple yet very powerful algorithm for clustering data. It is a predictive algorithm for determine the clusters. There is a whole lot of research done around K-Means because it provides fast and reliable solutions for most practical applications. The idea is to put the data points in to a cluster with the smallest distance from the clusters mean to the data point.

The standard K-means algorithm will work as follows. For a given set of data objects D and a pre defined number of clusters k, k data objects are selected randomly to initialize k clusters, each one being the randomly selected centroid value of a cluster. The remaining objects are then assigned to the cluster represented by the nearest or most similar centroid value. Next, the new centroid value is re- computed for each cluster and in turn all documents are re- assigned based on the calculated new centroid value. This step iterates until a final and fixed solution is reached, where

The global text document clustering is performed by using the genetic algorithm. The main objective of the genetic algorithm is to provide exact/perfect clustering of text documents by providing the high intra similarity and low inter similarity.

Genetic algorithm developed by Goldberg was inspired by Darwin's theory of evolution which states that the survival of an organism is affected by rule "the strongest species that survives". Darwin also stated that the survival of an organism can be maintained through the process of reproduction, crossover and mutation. Darwin's concept of evolution is then adapted to computational algorithm to find solution to a problem called objective function in natural fashion. A solution generated by genetic algorithm is called a chromosome, while collection of chromosome is referred as a population. A chromosome is composed from genes and its value can be either numerical, binary, symbols or characters depending on the problem want to be solved. These chromosomes will undergo a process called fitness function to measure the suitability of solution generated by GA with problem. Some chromosomes in population will mate through process called crossover thus producing new chromosomes named offspring which its genes composition are the combination of their parent. In a generation, a few chromosomes will also mutation in their gene. The number of

chromosomes which will undergo crossover and mutation is controlled by crossover rate and mutation rate value. Chromosome in the population that will maintain for the next generation will be selected based on Darwinian evolution rule, the chromosome which has higher fitness value will have greater probability of being selected again in the next generation. After several generations, the chomosome value will converges to a certain value which is the best solution for the problem. The genetic algorithm is used in various application areas.

The operations that performs in the genetic algorithm contains are Selection, Crossover and Mutation
1. Selection: The candidate individuals are chosen from the population in the current generation based on their fitness. The individuals with higher fitness values are more likely to be selected as the individuals of population in the next generation.
2. Crossover: Crossover is a genetic operator that combines (mates) two chromosomes (parents) to produce a new chromosome (offspring). The idea behind crossover is that the new chromosome may be better than both of the parents if it takes the best characteristics from each of the parents. Crossover occurs during evolution according to a user- definable crossover probability. Consider the following 2 parents which have been selected for crossover. The | symbol indicates the randomly chosen crossover point.
Parent1:11001|010 Parent2:00100|111

After interchanging the parent chromosomes at the crossover point, the following offspring are produced:

Offspring1:11001|111 Offspring2:00100|010
1. Generate chromosome and the initialization value with a random value
2. Process steps 4-7 until the number of generations is met
3. Evaluation of fitness value of chromosomes by calculating objective function
4. Chromosomes selection
5. Crossover
6. Mutation
7. New Chromosomes (Offspring)
8. Solution (Best Chromosomes) End
  
  5. Evaluation Metric Module
  
  In order to check the quality and accuracy of the clustering algorithm the proposed system uses the metrics such as purity, entropy and F-Measure. The purity and entropy measures are used to calculate the quality of the clusters whereas the F-Measure used to check the accuracy of the clustering operations. The evaluation metrics such as the purity, entropy and F-measure are explained below
  1. Purity: The metric purity evaluates the consistency of a cluster that is the degree to which a cluster contains documents from a single category. If the purity value is one it contains documents from a single category therefore it is an ideal cluster. The purity value lies in the range of 0-1.The higher the purity value, better the quality of clusters. If the purity value is one it contains documents from a single category therefore it is an ideal cluster. The higher the purity value, better the quality of clusters. The formal definition of purity is as given below in the equation 8.1
3. Mutation: The mutation operator is applied to each bit of

P ( Cj ) = 1

nj

max h

(n h) (8.1)

j

j

an individual with a probability of mutation rate. After mutation, a bit that was 0 changes to 1 and vice versa. In fact, it is possible that a regular node becomes a cluster head and a cluster head becomes a regular node. Individual before mutation: 0 1 1 1 0 0 1 1 0 1 0 individual after mutation: 0 1 1

0 0 0 1 1 0 1 0
1. Algorithm
  
  The Algorithm for the genetic algorithm process is as follows INPUT: k-means clustered data
  
  OUTPUT: clusters of document Begin
  1. Determine the number of chromosomes, generation, and mutation rate and crossover rate value
    
    Where, max h(n h) – is the number of documents that are from
    
    j
    
    j
    
    the dominant category in cluster Cj and
    
    h
    
    h
    
    (nj ) – represents the number of documents from
    
    cluster
    
    Cj assigned to category h
  2. Entropy: In general, is a measure of the number of specific ways in which a system is arranged. This measure evaluates the distribution of categories in a given cluster. The entropy results lies between the 0-
    1. If the entropy value is smaller, the quality of clusters is better. The mathematical formula for the entropy is given as below in the equation 8.2
      
      (8.2)
      
      h
      
      h
      
      Where, (ni ) – represents the number of documents in cluster Ci assigned to category h
      
      ni represents the size of the cluster.
  3. F-Measure: It is a combined value of Precision and Recall. The Precision and recall computed for each class and its weighted average gives the value of F- measure. The value of this metrics also lies between 0-1. More the F-measure more the accuracy. It is calculated as shown in the equation 8.3
F-Measure= 2 * Precision*Rec all/ (Precision + Recall) (8.3)

Precision and recall are the basic measures used in evaluating strategies in finding the F-Measure evaluation metric. Both precision and recall is applied to the collected documents and the documents can be assumed as either relevant or irrelevant data that is it measures the degree of relevancy.

Precision is the ratio of the number of relevant text documents retrieved to the total number of irrelevant and relevant documents retrieved. It is usually expressed as shown in the equation 8.4

Precision= A/ (A+C) (8.4) Where, A Number of relevant documents retrieved

C Number of irrelevant documents retrieved

Recall is the ratio of the number of relevant records retrieved to the total number of relevant records in the database. It is usually expressed as shown in the 8.5

Recall = A/ (A+B) (8.5) Where, A Number of relevant documents retrieved

B Number of irrelevant documents retrieved
RESULTS AND DISSCUSSIONS

The proposed system framework for efficient document clustering on centralized system has used the two clustering algorithm for efficient text documents clustering. Initially in the local systems the proposed work uses the K- Means clustering algorithm and in the central global system the proposed work uses the genetic algorithm. The K-Means algorithm works based on the three similarity measures such as Cosine similarity, Jaccard coefficient and Pearson correlation coefficient. In the proposed system three performance metrics are used such as purity, entropy and F- Measure. Among them purity and entropy are used to evaluate the overall quality of the clusters and F-Measure is used to measure the accuracy of the clusters.

From the table 9.1, the conclusion of the proposed system is that, among the three similarity measures Jaccard and Pearson correlation coefficient measures generate more coherent clusters results than cosine similarity measure.

Among the two clustering algorithm K-Means clustering algorithm and genetic algorithm, the genetic algorithm produces the good clusters than the K-Means.

Table 9.1: Final evaluated results of K-means and genetic algorithm

Evaluation Metric value

Evaluation Metric value

The figure 9.1 shows the results of evaluation metric calculated by using the purity, entropy and F-measure for the both K-means clustering algorithm and genetic algorithm. The graph shows that the genetic algorithm has the highest purity and F-measure value and lower entropy value so, the genetic algorithm will produce better cluster results than K- means clustering algorithm.

1.2

1

0.8

0.6

0.4

0.2

0

F-measure

Entropy Purity

1.2

1

0.8

0.6

0.4

0.2

0

F-measure

Entropy Purity

Algorithms

Algorithms

Figure 9.1: Graph showing the final results of K-means and genetic algorithm
CONCLUSION

The proposed framework for efficient document clustering on text documents and the two clustering algorithm such as K-Means and Genetic clustering algorithm were used. The K-Means algorithm was developed using Cosine similarity, Jaccard coefficient and Pearson correlation coefficient similarity metrics. The correctness of the algorithm was checked by taking the 10 text documents.

Genetic algorithm is used as a good clusterng algorithm to perform clustering on the text documents. Hence, by using the genetic algorithm clustering results, the

similarity measures are analyzed to encounter which is the best similarity measure.

By taking the 100 text documents, the effectiveness of the measures were evaluated and analyzed. The Jaccard coefficient and Pearson correlation coefficient measures generate more coherent clusters than the cosine similarity measures and genetic algorithm produces the good cluster result than K-Means clustering algorithm.
FUTURE ENHANCEMENT

With rapid growth of IT environment, there will be a large number of documents has to be maintained. So, by using the large data sets as well as the different data sets the clustering can be performed.

The proposed system works only for the text document which is in the format .txt. So, this system is still being extended to work for the images also.
REFERENCES

Data Mining Concepts and Techniques Second Edition by Jiawei HanandMicheline Kamber University of Illinois at Urbana-Champaign.
ManjotKaur, NavjotKaur, Web Document Clustering Approaches Using K-Means Algorithm, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 5, May 2013.
Andreas Hotho, Steffen Staab, GerdStumme, Text Clustering Based on Background Knowledge, Technical Report, volume 425. University of Karlsruhe, Institute AIFB, (2003).
M. Eisenhardt, W. Muller, and A. Henrich, Classifying documents by distributed P2P clustering. in INFORMATIK, 2003.
S. Datta, C. R. Giannella, and H. Kargupta, Kmeans Clustering over a Large, Dynamic Network, Proc. SIAM Intl Conf. Data Mining (SDM), 2006.
M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques, Proc. KDD Workshop Text Mining, 2000.
G. Forman and B. Zhang, Distributed Data Clustering Can Be Efficient and Exact, SIGKDD Explorations Newsletter, vol. 2, no. 2, pp. 34-38, 2000.
S. Datta, K. Bhaduri, C. R. Giannella, R. Wolff and H. Kargupta, Distributed data mining in Peer-to-Peer networks, IEEE Internet Computing, vol.10 , no. 4, pp. 18-26, July 2006.
I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan, Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications, Proc. SIGCOMM, 2001.
Neethi Narayanan, J.E.Judith, Dr.J.Jayakumari Enhanced Distributed Document Clustering Algorithm Using Different Similarity MeasuresProc of 2013 IEEE Conf on ICT 2013.
Denny Hermawanto,Genetic Algorithm for Solving Simple Mathematical Equality Problem, Indonesian Institute of Sciences (LIPI), INDONESIA.
A. K. Santra, C. Josephine Christy, Algorithm and Confusion Matrix for Document Clustering Dean, CARE School of Computer Applications, India.
Aastha Joshi, Rajneet Kaur, Comparative Study of Various Clustering Techniques in Data Mining Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India.
Anna Huang, Similarity Measures for Text Document Clustering The University of Waikato, Hamilton, New Zealand, April 2008.
Sanjivani Tushar Deokar , Text Documents clustering using K Means Algorithm International Journal of Technology and Engineering Science [IJTES], July 2013.
Mohit Sharma and Pranjal Singh Text Document Clustering and Similarity Measures IIT Kanpur, India, November 19, 2013

Comparison Between K-Means and Genetic Algorithm in Text Document Clustering

Leave a Reply