 Open Access
 Total Downloads : 1911
 Authors : Ms. Soniya S. Dadhania, Prof. J. S. Dhobi
 Paper ID : IJERTV1IS3186
 Volume & Issue : Volume 01, Issue 03 (May 2012)
 Published (First Online): 30052012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Improved kNN Algorithm by Optimizing Crossvalidation
Ms. Soniya S. Dadhania Prof. J. S. Dhobi
M.E Computer Science and Engineering Head of Computer Department,GEC,Modasa
Abstract
Nowadays web applications based on short text is increasing rapidly. Moreover, the classification algorithms which are applied to short text data are Support Vector Machines algorithm, k Nearest Neighbors algorithm and Naive Bayes algorithm. k NN algorithm depends on the distance function and the value of k nearest neighbor. Traditional k NN algorithm can select best value of k using cross validation but there is unnecessary processing of the dataset for all possible values of k . Proposed k NN algorithm is an optimized form of traditional k NN by reduceing the time and space for evaluating the algorithm. Experiments are performed in developer version of weka 3.7.5.Comparison of proposed k NN algorithm is done with traditional k NN algorithm, Support vector machine and NaÃ¯ve Bayes algorithm. The proposed algorithm is more promising than the traditional k NN algorithm as time tak en to process and space used for crossvalidation in classification are reduced.

Introduction
The increasingly impo rtant role played by short texts in the modern means of Web commun ication and publishing, such as Twitter messages, blogs, chat massages, book and movie summa ries, foru m, news feeds, and customer revie ws, opens new application avenues for text min ing techniques but it also raises new scientific challenges. Although text classification and clustering are well established techniques, they are not successful in dealing with short and sparse data, because standard text similarity measures require substantial word cooccurrence or shared context.
Te xt classificat ion is a lea rning task, where pre – defined category labels are assigned to documents based on the like lihood suggested by a training set of labeled documents. Text categorization methods proposed in the literature are difficu lt to compare. Datasets used in the experiments are rarely same in diffe rent studies. Even when they are the same, diffe rent studies usually use different portions of the datasets or they split the datasets as training and test sets differently. Moreover, classifications will be performed using Support Vector Machines, kNearest Neighbors and Naive Bayes. For the analysis and
comparison of different results, precision, recall and F measure are used.
KNN is a typical e xa mp le of la zy lea rning. La zy learning simply stores training data at training time and delays its learning until c lassification time. In contrast, eager learning generates an explicit model at training time. kNN a lgorith m classifies a test document based on it k nearest neighbour. The training e xa mples can be considered as vectors in a multid imensional feature space. The space is partitioned into regions by locations and labels of the training samples. A point in the space is assigned to a class in which most of the train ing points belong to that class within the k nearest training samples. Usually Euc lidean distance or Cosine similarity is used. During the classification phase, the test sample (whose class needs to be identified) is also represented as a vector in the feature space. Distances or simila rit ies fro m the test vector to all train ing vectors are co mputed and k nearest training samples is selected . There are a nu mber o f ways to classify the test vector to a specific c lass. The classical kNN a lgorith m determines the class with the ma jority voters fro m its k nearest neighbours [2].

Scope of improve ment in kNN
Although kNN has been widely used for decades due to its simplicity, effect iveness, and robustness, it can be improved according to the application. Improve ment can be done on follo wing para meters.

Distance Functi on: The distance function for measuring the difference or similarity between two instances is the standard Euclidean distance.

Selection of Value K: The neighborhood size is artific ially assigned as an input parameter.

Calcul ating Class Probability: The c lass probability estimation is based on a simple voting.


Proposed kNN algorithm
Distance function
kNN a lgorith m depends on the distance function used for calculat ing the distance between input test object and objects in training set. To measure the distance of data in the kNN, the distance function is important The most commonly used function is the Euclidean distance function (Euclid), wh ich measures two input vectors (one typically being from a stored instance, and the
other being an input vector to be classified). One weakness of the Euclidean distance function is that if one of the input attributes has a relatively large range, then it can overpower the other attributes. Therefore, distances are often norma lized by divid ing the distance for each attribute by the range (i.e., ma ximu m minimu m) of that attribute. The cosine similarity is commonly used in te xt c lassification [15].
In proposed algorithm cosine similarity function applied instead of Euclid ian distance but the results are found simila r both distance functions.
Cosine Simila rity
Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot product and magnitude as
For te xt matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of norma lizing document length during comparison. In the case of information retrieva l, the cosine simila rity of two documents will range fro m 0 to 1, since the term frequencies (tfidf we ights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90Â°.
Selection of value k
In the traditional kNN a lgorith m, the value of k is fixed beforehand. If k is too large, big classes will overwhelm s ma ll ones. On the other hand, if k is too small, the advantage of kNN algorith m, which could ma ke use of many experts, will not be exhibited. To find best value of k suitable to the data set traditional kNN a lgorith m uses cross validation. The CROSSVA LIDATION specifies settings for performing Vfold c rossvalidation to determine the best number of neighbors.

Vfold cross validation divides the data into V fo lds. Then, for a fixed k, it applies nearest neighbor analysis to make predict ions on the vth fold (using the other V1 folds as the training sample ) and evaluates the error. This process is successively applied to all possible choices of v. At the end of V folds, the computed errors are averaged. The above steps are repeated for various values of k. The va lue achieving the lowest average error is selected as the optima l value for k.

If mu ltiple values of k a re tied on the lowest average error, then the smallest k a mong those that are tied is selected.[17]
Crossvalidation process of traditional kNN a lgorith m works effic iently when number of instances is small in
data set i.e. when size of data set is small but as the size of data set increases it takes larger time to cross – validate for each value of k specified by user in terms of ma x value of k.
Example A: for better understanding of cross – validations in Tradit ional kNN:
For some data set DS1: No. of attributes: 21 No. of instances: 12000
Maximu m value of k: 10 Best value of k: 5
The crossvalidation process will repeated for:
(Ma x k 1) * No. of instances = (10)*12000=9
*12000 = 108000
This is large value and takes la rge time for processing.
To overcome proble m of unnecessary iteration in crossvalidation for finding best value of k new algorith m is proposed.
The CROSSVA LIDATION in proposed kNN algorith m also specifies setting for performing V fold crossvalidation but for determining the best number of neighbors the process of cross validation is not applied to all choice of v but stop when the best value is found. It is observed from the results of effect of value of k in kNN that before and after achieving best value of k accuracy of classification decreases. The crossvalidation process starts from ma ximu m value of k specified as input up to value of k = 1.
In proposed algorithm at each vfold performance of the previous fold is compared if the performance is decreased at vfold then value of k used in previous fold is selected as best value of k. Newly proposed kNN algorithm will reduce the numb er of iterations for finding out the best value of k. due to decrease in number of iteration time and space needed to find best k is also decreased.
Applying proposed kNN in Exa mple A we get: The crossvalidation process will repeated for:
(Ma x k1best k) * No. of instances = (1015)*12000
= 4*12000 = 48000
In proposed kNN algorith m the iteration of the loop for finding best value of k is reduced from 108000 to 48000.


Experime ntal results
WEKA 3.7.5 is used for performing e xperiments. Proposed algorithm is coded using Java. Datasets are taken fro m http://kavitaganesan.com/opinosis opinion dataset and http://boston.lti.cs.c mu.edu/classes/95 865/HW/HW 2/
P=prec ision, R=reca ll, F=F measure.
Comparing kNN, NaÃ¯ve Bayes and SVM for short text classification
Data set 1: Rev iew o f notebook No. of attributes: 7
No. of instances: 19
Crossvalidation: 10 fo lds
Ca
teg ory
No
kNN
NaÃ¯ve B ayes
SVM
P
R
F
P
R
F
P
R
F
1
1
1
1
1
1
1
1
1
1
2
0.5
1
0.66
0.55
1
0.71
0
0
0
3
1
1
1
0.83
1
0.9
0.66
0.8
0.72
4
1
1
1
1
0.75
0.85
1
0.75
0.85
5
0
0
0
0
0
0
0
0
0
Avg
0.68
0.79
0.72
0.66
0.75
0.68
0.51
0.5
0.50
Table 1: Co mparison in Data set 1
Data set 2: Rev iew of Swiss hotel No. of attributes: 6
No. of instances: 18
Crossvalidation: 10 fo lds
Ca teg ory
No
kNN
NaÃ¯ve B ayes
SVM
P
R
F
P
R
F
P
R
F
1
0.8
0.8
0.8
0.75
0.6
0.66
0.8
0.8
0.8
2
1
1
1
1
1
1
1
1
1
3
0.6
0.75
0.66
0.5
0.75
0.6
0.42
0.75
0.54
4
1
0.8
0.88
1
0.8
0.88
1
0.4
0.57
Av
0.85
0.83
0.84
0.82
0.78
0.79
0.88
0.72
0.72
Table 2 Co mparison in Data set 2
Data set 3: Auto No. of attributes: 21
No. of instances: 12000
Classifier: kNN
Crossvalidation: 10 fo lds
Ca teg ory
No
kNN
NaÃ¯ve B ayes
SVM
P
R
F
P
R
F
P
R
F
1
0.99
0.99
0.99
0.99
0.99
0.99
1
0.99
0.99
2
0.99
0.99
0.99
0.99
1
0.99
0.99
1
0.99
Av
0.99
0.99
0.98
0.99
0.99
0.99
0.99
0.99
0.99
Table 3 Co mparison in Data set 3
Data set 4: Ford No. of attributes: 11
No. of instances: 6000
Classifier: kNN
Crossvalidation: 10 fo lds
Ca teg ory
No
kNN
NaÃ¯ve B ayes
SVM
P
R
F
P
R
F
P
R
F
1
0.87
0.87
0.87
0.68
0.96
0.80
0.73
0.94
0.82
2
0.87
0.87
0.87
0.93
0.56
0.70
0.92
0.65
0.76
Av
0.87
0.87
0.87
0.81
0.76
0.75
0.82
0.80
0.79
Table 4 Co mparison in Data set 4
Aver age result of Data set 1
P recision
Recall
F measure
kNN
0.688
0.792
0.722
NaÃ¯ve Bayes
0.664
0.75
0.689
SVM
0.514
0.5
0.503
Table 5 Average result of Data set 1
Aver age result of Data set 2
P recision
Recall
F measure
kNN
0.856
0.833
0.84
NaÃ¯ve Bayes
0.819
0.778
0.788
SVM
0.817
0.722
0.724
Table 6 Average result of Data set 2
Aver age result of Data set 3
P recision
Recall
F measure
kNN
0.998
0.998
0.998
NaÃ¯ve Bayes
0.997
0.997
0.997
SVM
0.995
0.995
0.995
Table 7 Average result of Data set 3
Aver age result of Data set 4
P recision
Recall
F measure
kNN
0.878
0.878
0.877
NaÃ¯ve Bayes
0.814
0.764
0.754
SVM
0.829
0.802
0.798
Table 8 Average result of Data set 4
Examining effect of k value in kNN
Data set: 1
Detail of data set: rev iew o f iPod No. of attributes: 68
No. of instances: 19 Classifier: kNN
Crossvalidation: 10 fo lds
Figure 1 Cop marison between kNN, NB and SVM
Selected
value of k
Correct ly
classified instances
1
99.81%
3
99.79%
5
99.7%
7
99.67%
Selected value of k
Correct ly classified
instances
1
63.16%
3
84.21%
5
78.94%
8
73.68%
Table 10 Effect of k value in Data set 3 Data set: 3
No. of attributes: 21
No. of instances: 6000 Classifier: kNN
Selected value of k
Correct ly
classified instances
1
87.33%
3
88.78%
5
89.23%
7
89.41%
9
89.3%
11
89.09%
Crossvalidation: 10 fo ld
Table 9 Effect of k value in Data set 1 Data set: 2
No. of attributes: 21 No. of instances: 12000 Classifier: kNN
Crossvalidation: 10 fo lds
Table 11 Effect of k value in Data set 3 Data set: 4
No. of attributes: 21 No. of instances: 1382 Classifier: kNN
Crossvalidation: 10 fo ld
Selected value of k
Correct ly classified
instances
1
62.44%
3
66.20%
5
68.16%
7
68.23%
9
68.37%
11
69.17%
13
69.60%
15
69.97%
17
70.47%
19
69.75%
21
69.68%
value of k
classifi cation
n in traditio nal
kNN
on in propos ed
kNN
of iterati on
Data set – 1
10
84.21
9
7
2
Data
set – 2
10
99.81
9
9
0
Data
set – 3
10
89.41
9
4
5
Data
set – 4
20
70.47
19
4
15
Accuracy Increases
Best k
Accuracy Decreases
Table 12 Effect of k value in Data set 4
Form the above tables it can be concluded that if value of k is appropriate to the data then increase in efficiency of the kNN will be noticeable. Here in data set 1 value of k=3 is best value of k which gives 84.21% correctly c lassified instances. If va lue of k is less then or greater then best k value, it will affect the performance of kNN.
Also from observations it is clear that accuracy increases up to the best value of k and then after accuracy starts to decrease.
Comparison of traditional kNN and proposed kNN
To find best value of k suitable to the data set traditional kNN a lgorithm uses cross validation.
The CROSSVA LIDATION in proposed kNN algorith m also specifies setting for performing V fold crossvalidation but for determining the best number of neighbors the process of cross validation is not applied to all choice of v but stop when the best value is found. It can be observed from the results of effect of value of k in kNN that before and after achieving best value of k accuracy of classification decreases. Newly proposed kNN algorithm will reduce the number of iterations for finding out the best value of k. Due to decrease in number of iteration time and space needed to find best k are a lso decreased.
Data sets used to exa mine the effect of value of k are
used in comparison of traditional and proposed algorith m.
Data
set
Maxi
mu m
% of
correct
No. of
iteratio
No. of
iterati
Reduc
ed no.
Table 13 Tradit ional kNN v/s Proposed kNN
Figure 2 Traditional kNN v/s Proposed kNN

Conclusion
For short text c lassification kNN, NaÃ¯ve Bayes and SVM a lgorith ms can be used. Fro m the results in section 4 it is concluded that kNN g ive better accuracy than other two algorith ms. Also when kNN a lgorith m is used with attribute selection its accuracy for classification inc reases. kNN algorith m depends on the distance function and the value of k nearest neighbor, traditional kNN a lgorithm finds best value of k using crossvalidation. But time and space used by it is larger due to unnecessary processing for each and every value of k fro m Ma ximu m k to 1. In p roposed kNN a lgorith m the unnecessary processing of cross validation is reduced due to which time and space used for classification is also reduced. Table 13 shows reduced number of ite ration in p roposed algorith m. Proposed kNN is mo re promising than traditional kNN as larger dataset can be used for classification and time for evaluation and building model fo r large dataset is reduced.

References

Arzucan Â¨OzgÂ¨ur, Levent Â¨ OzgÂ¨ur, and Tunga GÂ¨ungÂ¨or Text Categorization with classbased and corpusbased keyword selection In Proceedings of the 20th international conference on Computer and Information Sciences, Publisher: Springer Verlag October,2005

XindongWu, Vipin Kumar, J. Ross Quinlan , Joydeep
Ghosh , Qiang Yang, Hiroshi M otoda , Geoffrey J. M cLachlan, Angus Ng , Bing Liu , Philip S. Yu, Zhi Hua Zhou, M ichael Steinbach, David J. Hand and Dan Steinberg Top 10 algorithms in data mining Journal Knowledge and Information Systems Volume 14 Issue 1,
December ,2007

XuanHieu Phan, LeM inh Nguyen, Susumu Horiguchi Learning to classify short and sparse text & web hidden topics from largescale data collection In Proceedings of the 17th international conference on World Wide Web ACM New York, NY, USA,2008

Alfonso M arin Comparison of Automatic Classifiers Performances using Wordbasd Feature Extraction Techniques in an Egovernment setting University essay from KTH/Skolan for informations och kommunikationsteknik (ICT) January,2011

Victoria Bobicev, M arina Sokolova An Effective and Robust M ethod for Short Text Classification In Proceedings of the TwentyThird AAAI Conference on Artificial Intelligence volume 3, July 2008

Hitesh Sajnani, Sara Javanmardi, DavidW. M cDonald, Cristina V. Lopes M ultiLabel Classification of short text: A Study on wikipedia Barnstars Paper from AAAI 11 workshop on Analyzing microtext 2011

Bengel, J, Gauch, S. M ittur, E. and Vijayaraghavan
R.Chat room topic detection using classification. In proceeding of 2nd Symposium on Intelligence and Security Informatics,2004

Yiming Yang. A study of thresholding strategies for text categorization. In proceeding of 24th International ACM SIGIR Conference 2001

Yang, Y. and Liu, X. A reexamination of text categorization methods In proceedings of the 22nd annual international ACM SIGIR conference on Research and Information Retrieval 1999

Wei Wang, Sujian Li, Chen Wang An Improved KNN Algorithm for Text Categorization In Proceedings of NTCIR7 Workshop Meeting 2008

AliM ustafa Qamar, Eric Gaussier, JeanPierre Chevallet, Joo Hwee Lim, Similarity Learning for Nearest Neighbor Classification ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining 2008

Wentau Yih and Christopher M eek, Improving Similarity M easures for Short Segments of Text Proceedings of the 22nd nationa l conference on Artificial intelligence – Volume

Jasmine Kalathipparambil Sudhakaran, Ramaswamy Vasantha A M ixed M ethod Approach for Efficient Component Retrieval from a Component Repository
Journal of Software Engineering and Applications ,2011

Gautam Bhattacharya a, Koushik Ghosh b, Ananda S. Chowdhury An affinity based new local distance function and similarity measure for kNN algorithm
Pattern Recognition Letters , Volume 33 Issue 3 Publisher: Elsevier Science Inc ,2012

M uhammed M iah Improved kNN Algorithm for Text Classification Department of Computer Science and Engineering University of Texas at Arlington, TX,
USA,2003

Z. Xie, W. Hsu, Z. Liu, & M. Lee, SNNB: A selective neighborhood based Naive Bayes for lazy learning", Proc. 6th PacificAsia Conf. on KDD, Taipei, Taiwan, 2002

cross validation subcommand in kNN, http://publib.boulder.ibm.com/infocenter/spssstat/v20r0 m0/index.jsp