 Open Access
 Total Downloads : 43
 Authors : Rachit Agarwal , Sayali Gawade , Priyanka Pirale , Deepti Dighe
 Paper ID : IJERTV7IS030184
 Volume & Issue : Volume 07, Issue 03 (March 2018)
 Published (First Online): 26032018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Clustering of Web Search Result using Metaheuristic Algorithm
Rachit Agarwal
Information Technology
Shah and Anchor Kutchhi Engineering College, Mumbai, India
Sayali Gawade
Information Technology
Shah and Anchor Kutchhi Engineering College, Mumbai, India
Abstract This paper is an attempt to understand the clustering using k means more effectively with actual values. Clustering is a very integral part of mining domain. This paper is going to present the usage kmeans to form the clusters of documents. To experiment this we are going to use AMBIENT dataset. We are going to use Balanced Bayesian Algorithm to calculate the probability to decide in which cluster the document should be placed.
The idea of our project is to create nests or cluster of the items searched by the user. For example, if the user searches the word ORANGE the output that appears on any web browser is either a fruit or a color or may be a data mining tool etc. It is not worth if the user spends the time visiting all the web links to get to the required result. Hence, after applying the clustering we are going to segregate the search results into the form of clusters for easy navigation. Orange as a color will be stored in a particular cluster whereas orange as a fruit will be in another cluster. This will save the users time and will make searching more efficient. Also the comparison of the results will be much simpler to perform due to formation of clusters.
Key Words: Kmeans, Balanced Bayesian, Web search result clustering, Metaheuristic algorithm
INTRODUCTION:
Data clustering is the process of grouping data elements in a way that makes the elements in a given group similar to each other in some aspect.
Deepti Dighe
Information Technology
Shah and Anchor Kutchhi Engineering College, Mumbai, India
Priyanka Pirale
Information Technology
Shah and Anchor Kutchhi Engineering College, Mumbai, India
Return Solution
To obtain good results in web document clustering the algorithms must meet the following specific requirements [3]:

automatically define the number of clusters that are going to be created;

generate relevant clusters for the user and assign the documents to appropriate clusters;

define labels or names for the clusters that are easily understood by users;

handle overlapping clusters (documents can belong to more than one cluster);

reduce the high dimension of document collections;

handle the processing time, i.e. less than or equal to 2 s; and

handle the noise that is frequently found in documents.

Balanced Bayesian Algorithm
In the data balancing problem the goal is to update a probability distribution, under the guiding principle that the best inference is the one which takes into account all available information and no other. This principle is operationalized by searching for a posterior distribution that is as close as possible to the prior (in an information sense) and that satisfies the accounting identities, expressed in terms of moment constraints [4].
where n is the total number of records(documents), k is the number of clusters , Pij equals to 1
when the document xi belongs to the cj cluster, otherwise 0.[2,5]
ALGORITHM:
Select an Initial Partition (k centers) Repeat
Data Assignment: Recompute Membership Relocation of means: Update Centers Until (Stop Criterion)
In Bayes rule, the product of prior probability () and the likelihood of data given a parameter. Vector f(y) result in the posterior distribution where y is the data and are the model parameters. The denominator m(y) is known as the marginal likelihood of the data. It is found by integrating prior densities depending on the dimensionality of .

TFIDF
The TFIDF is a abbreviation used for term frequency inverse document frequency. It is a numerical approach which intends to how important word is in that document. Mostly used in data and text mining. The value of TFIDF increases proportionally with the no. of times the word has appeared in the document. It basically shows how frequently the word has appeared in the document [2].
TermByDocument Matrix (TDM):
The TDM matrix is the most widelyused structure for document representation in IR, and is based on the vector space model [6,31]. In this model, the documents are designed as bags of words; the document collection is represented by a matrix of Dterms by ndocuments. Each document is represented by a vector of normalized frequency term (tft) by the document inverse frequency for that term, in what is known as TFIDF value (expressed by Eq. (7)), and the cosine similarity (seeEq. (3)) is used for measuring the degree of similarity between two documents or between a document and cluster centroid.
Wt,i= freqt,i /max(freqi)*log(n/nt)
where freqt,I observed frequency of the term t in document representation in IR [2]. . In this model, the documents are designed as bags of words; the document collection is represented by a matrix of Dterms by ndocuments. Each document is represented by a vector of normalized frequency term (tft) by the document inverse frequency for that term, in what is known as TFIDF value.
CONCLUSION
Thus web search clustering can be successfully implemented using kmeans and Balanced Bayesian algorithm. Tfidf matrix can also be generated using the above discussed formulas and based on its values we can determine the document in which the word can be placed.
REFERENCES

Anil K. Jain, Data Clustering: 50 Years Beyond K Means. Michigan State University, Michigan.

Clustering of web search results based on the cuckoo search algorithm and Balanced Bayesian Information Carlos Cobos a,b, Henry MuÃ±oz
Collazos , Richar UrbanoMuÃ±oz a

C. Carpineto, S. OsinÂ´ ski, G. Romano, D. Weiss, A survey of Web clustering engines, ACM Comput. Surv. 41 (2009) 138

Hamerly, G.; Elkan, C. (2002). "Alternatives to the kmeans algorithm that find better clusterings" (PDF). Proceedings of the eleventh international conference on Information and knowledge management (CIKM).

Celebi, M. E., Kingravi, H. A., and Vela, P. A. (2013). "A comparative study of efficient initialization methods for the kmeans clustering algorithm"