Improved kNN Algorithm by Optimizing Cross-validation

Ms. Soniya S. Dadhania; Prof. J. S. Dhobi

doi:10.17577/IJERTV1IS3186

Volume 01, Issue 03 (May 2012)

Improved kNN Algorithm by Optimizing Cross-validation

DOI : 10.17577/IJERTV1IS3186

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 453
Total Downloads : 1911
Authors : Ms. Soniya S. Dadhania, Prof. J. S. Dhobi
Paper ID : IJERTV1IS3186
Volume & Issue : Volume 01, Issue 03 (May 2012)
Published (First Online): 30-05-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Improved kNN Algorithm by Optimizing Cross-validation

Ms. Soniya S. Dadhania Prof. J. S. Dhobi

M.E Computer Science and Engineering Head of Computer Department,GEC,Modasa

Abstract

Nowadays web applications based on short text is increasing rapidly. Moreover, the classification algorithms which are applied to short text data are Support Vector Machines algorithm, k -Nearest Neighbors algorithm and Naive Bayes algorithm. k NN algorithm depends on the distance function and the value of k nearest neighbor. Traditional k NN algorithm can select best value of k using cross- validation but there is unnecessary processing of the dataset for all possible values of k . Proposed k NN algorithm is an optimized form of traditional k NN by reduceing the time and space for evaluating the algorithm. Experiments are performed in developer version of weka 3.7.5.Comparison of proposed k NN algorithm is done with traditional k NN algorithm, Support vector machine and NaÃ¯ve Bayes algorithm. The proposed algorithm is more promising than the traditional k NN algorithm as time tak en to process and space used for cross-validation in classification are reduced.

Introduction

The increasingly impo rtant role played by short texts in the modern means of Web commun ication and publishing, such as Twitter messages, blogs, chat massages, book and movie summa ries, foru m, news feeds, and customer revie ws, opens new application avenues for text min ing techniques but it also raises new scientific challenges. Although text classification and clustering are well established techniques, they are not successful in dealing with short and sparse data, because standard text similarity measures require substantial word co-occurrence or shared context.

Te xt classificat ion is a lea rning task, where pre – defined category labels are assigned to documents based on the like lihood suggested by a training set of labeled documents. Text categorization methods proposed in the literature are difficu lt to compare. Datasets used in the experiments are rarely same in diffe rent studies. Even when they are the same, diffe rent studies usually use different portions of the datasets or they split the datasets as training and test sets differently. Moreover, classifications will be performed using Support Vector Machines, k-Nearest Neighbors and Naive Bayes. For the analysis and

comparison of different results, precision, recall and F- measure are used.

KNN is a typical e xa mp le of la zy lea rning. La zy learning simply stores training data at training time and delays its learning until c lassification time. In contrast, eager learning generates an explicit model at training time. k-NN a lgorith m classifies a test document based on it k nearest neighbour. The training e xa mples can be considered as vectors in a multid imensional feature space. The space is partitioned into regions by locations and labels of the training samples. A point in the space is assigned to a class in which most of the train ing points belong to that class within the k nearest training samples. Usually Euc lidean distance or Cosine similarity is used. During the classification phase, the test sample (whose class needs to be identified) is also represented as a vector in the feature space. Distances or simila rit ies fro m the test vector to all train ing vectors are co mputed and k nearest training samples is selected . There are a nu mber o f ways to classify the test vector to a specific c lass. The classical k-NN a lgorith m determines the class with the ma jority voters fro m its k- nearest neighbours [2].
Scope of improve ment in kNN

Although kNN has been widely used for decades due to its simplicity, effect iveness, and robustness, it can be improved according to the application. Improve ment can be done on follo wing para meters.
1. Distance Functi on: The distance function for measuring the difference or similarity between two instances is the standard Euclidean distance.
2. Selection of Value K: The neighborhood size is artific ially assigned as an input parameter.
3. Calcul ating Class Probability: The c lass probability estimation is based on a simple voting.
Proposed kNN algorithm

Distance function

kNN a lgorith m depends on the distance function used for calculat ing the distance between input test object and objects in training set. To measure the distance of data in the kNN, the distance function is important The most commonly used function is the Euclidean distance function (Euclid), wh ich measures two input vectors (one typically being from a stored instance, and the

other being an input vector to be classified). One weakness of the Euclidean distance function is that if one of the input attributes has a relatively large range, then it can overpower the other attributes. Therefore, distances are often norma lized by divid ing the distance for each attribute by the range (i.e., ma ximu m- minimu m) of that attribute. The cosine similarity is commonly used in te xt c lassification [15].

In proposed algorithm cosine similarity function applied instead of Euclid ian distance but the results are found simila r both distance functions.

Cosine Simila rity

Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot product and magnitude as

For te xt matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of norma lizing document length during comparison. In the case of information retrieva l, the cosine simila rity of two documents will range fro m 0 to 1, since the term frequencies (tf-idf we ights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90Â°.

Selection of value k

In the traditional kNN a lgorith m, the value of k is fixed beforehand. If k is too large, big classes will overwhelm s ma ll ones. On the other hand, if k is too small, the advantage of kNN algorith m, which could ma ke use of many experts, will not be exhibited. To find best value of k suitable to the data set traditional kNN a lgorith m uses cross -validation. The CROSSVA LIDATION specifies settings for performing V-fold c ross-validation to determine the best number of neighbors.
- V-fold cross validation divides the data into V fo lds. Then, for a fixed k, it applies nearest neighbor analysis to make predict ions on the vth fold (using the other V1 folds as the training sample ) and evaluates the error. This process is successively applied to all possible choices of v. At the end of V folds, the computed errors are averaged. The above steps are repeated for various values of k. The va lue achieving the lowest average error is selected as the optima l value for k.
- If mu ltiple values of k a re tied on the lowest average error, then the smallest k a mong those that are tied is selected.[17]
  Cross-validation process of traditional kNN a lgorith m works effic iently when number of instances is small in
  
  data set i.e. when size of data set is small but as the size of data set increases it takes larger time to cross – validate for each value of k specified by user in terms of ma x value of k.
  
  Example A: for better understanding of cross – validations in Tradit ional kNN:
  
  For some data set DS1: No. of attributes: 21 No. of instances: 12000
  
  Maximu m value of k: 10 Best value of k: 5
  
  The cross-validation process will repeated for:
  
  (Ma x k- 1) * No. of instances = (10-)*12000=9
  
  *12000 = 108000
  
  This is large value and takes la rge time for processing.
  
  To overcome proble m of unnecessary iteration in cross-validation for finding best value of k new algorith m is proposed.
  
  The CROSSVA LIDATION in proposed kNN algorith m also specifies setting for performing V- fold cross-validation but for determining the best number of neighbors the process of cross -validation is not applied to all choice of v but stop when the best value is found. It is observed from the results of effect of value of k in kNN that before and after achieving best value of k accuracy of classification decreases. The cross-validation process starts from ma ximu m value of k specified as input up to value of k = 1.
  
  In proposed algorithm at each v-fold performance of the previous fold is compared if the performance is decreased at v-fold then value of k used in previous fold is selected as best value of k. Newly proposed kNN algorithm will reduce the numb er of iterations for finding out the best value of k. due to decrease in number of iteration time and space needed to find best k is also decreased.
  
  Applying proposed kNN in Exa mple A we get: The cross-validation process will repeated for:
  
  (Ma x k-1-best k) * No. of instances = (10-1-5)*12000
  
  = 4*12000 = 48000
  
  In proposed kNN algorith m the iteration of the loop for finding best value of k is reduced from 108000 to 48000.

Experime ntal results

WEKA 3.7.5 is used for performing e xperiments. Proposed algorithm is coded using Java. Datasets are taken fro m http://kavita-ganesan.com/opinosis -opinion- dataset and http://boston.lti.cs.c mu.edu/classes/95- 865/HW/HW 2/

P=prec ision, R=reca ll, F=F measure.

Comparing kNN, NaÃ¯ve Bayes and SVM for short text classification

Data set 1: Rev iew o f notebook No. of attributes: 7

No. of instances: 19

Cross-validation: 10 fo lds

Ca teg ory No	k-NN			NaÃ¯ve B ayes			SVM
	P	R	F	P	R	F	P	R	F
1	1	1	1	1	1	1	1	1	1
2	0.5	1	0.66	0.55	1	0.71	0	0	0
3	1	1	1	0.83	1	0.9	0.66	0.8	0.72
4	1	1	1	1	0.75	0.85	1	0.75	0.85
5	0	0	0	0	0	0	0	0	0
Avg	0.68	0.79	0.72	0.66	0.75	0.68	0.51	0.5	0.50

Table 1: Co mparison in Data set 1

Data set 2: Rev iew of Swiss hotel No. of attributes: 6

No. of instances: 18

Cross-validation: 10 fo lds

Ca teg ory No	k-NN			NaÃ¯ve B ayes			SVM
	P	R	F	P	R	F	P	R	F
1	0.8	0.8	0.8	0.75	0.6	0.66	0.8	0.8	0.8
2	1	1	1	1	1	1	1	1	1
3	0.6	0.75	0.66	0.5	0.75	0.6	0.42	0.75	0.54
4	1	0.8	0.88	1	0.8	0.88	1	0.4	0.57
Av	0.85	0.83	0.84	0.82	0.78	0.79	0.88	0.72	0.72

Table 2 Co mparison in Data set 2

Data set 3: Auto No. of attributes: 21

No. of instances: 12000

Classifier: kNN

Cross-validation: 10 fo lds

Ca teg ory No	k-NN			NaÃ¯ve B ayes			SVM
	P	R	F	P	R	F	P	R	F
1	0.99	0.99	0.99	0.99	0.99	0.99	1	0.99	0.99
2	0.99	0.99	0.99	0.99	1	0.99	0.99	1	0.99
Av	0.99	0.99	0.98	0.99	0.99	0.99	0.99	0.99	0.99

Table 3 Co mparison in Data set 3

Data set 4: Ford No. of attributes: 11

No. of instances: 6000

Classifier: kNN

Cross-validation: 10 fo lds

Ca teg ory No	k-NN			NaÃ¯ve B ayes			SVM
	P	R	F	P	R	F	P	R	F
1	0.87	0.87	0.87	0.68	0.96	0.80	0.73	0.94	0.82
2	0.87	0.87	0.87	0.93	0.56	0.70	0.92	0.65	0.76
Av	0.87	0.87	0.87	0.81	0.76	0.75	0.82	0.80	0.79

Table 4 Co mparison in Data set 4

Aver age result of Data set -1

	P recision	Recall	F- measure
kNN	0.688	0.792	0.722
NaÃ¯ve Bayes	0.664	0.75	0.689
SVM	0.514	0.5	0.503

Table 5 Average result of Data set 1

Aver age result of Data set -2

	P recision	Recall	F- measure
kNN	0.856	0.833	0.84
NaÃ¯ve Bayes	0.819	0.778	0.788
SVM	0.817	0.722	0.724

Table 6 Average result of Data set 2

Aver age result of Data set -3

	P recision	Recall	F- measure
kNN	0.998	0.998	0.998
NaÃ¯ve Bayes	0.997	0.997	0.997
SVM	0.995	0.995	0.995

Table 7 Average result of Data set 3

Aver age result of Data set -4

	P recision	Recall	F- measure
kNN	0.878	0.878	0.877
NaÃ¯ve Bayes	0.814	0.764	0.754
SVM	0.829	0.802	0.798

Table 8 Average result of Data set 4

Examining effect of k value in kNN

Data set: 1

Detail of data set: rev iew o f iPod No. of attributes: 68

No. of instances: 19 Classifier: kNN

Cross-validation: 10 fo lds

Figure 1 Cop marison between kNN, NB and SVM

Selected value of k	Correct ly classified instances
1	99.81%
3	99.79%
5	99.7%
7	99.67%

Selected value of k	Correct ly classified instances
1	63.16%
3	84.21%
5	78.94%
8	73.68%

Table 10 Effect of k value in Data set 3 Data set: 3

No. of attributes: 21

No. of instances: 6000 Classifier: kNN

Selected value of k	Correct ly classified instances
1	87.33%
3	88.78%
5	89.23%
7	89.41%
9	89.3%
11	89.09%

Cross-validation: 10 fo ld

Table 9 Effect of k value in Data set 1 Data set: 2

No. of attributes: 21 No. of instances: 12000 Classifier: kNN

Cross-validation: 10 fo lds

Table 11 Effect of k value in Data set 3 Data set: 4

No. of attributes: 21 No. of instances: 1382 Classifier: kNN

Cross-validation: 10 fo ld

Selected value of k	Correct ly classified instances
1	62.44%
3	66.20%
5	68.16%
7	68.23%
9	68.37%
11	69.17%
13	69.60%
15	69.97%
17	70.47%
19	69.75%
21	69.68%

	value of k	classifi cation	n in traditio nal kNN	on in propos ed kNN	of iterati on
Data set – 1	10	84.21	9	7	2
Data set – 2	10	99.81	9	9	0
Data set – 3	10	89.41	9	4	5
Data set – 4	20	70.47	19	4	15

Accuracy Increases

Best k

Accuracy Decreases

Table 12 Effect of k value in Data set 4

Form the above tables it can be concluded that if value of k is appropriate to the data then increase in efficiency of the kNN will be noticeable. Here in data set -1 value of k=3 is best value of k which gives 84.21% correctly c lassified instances. If va lue of k is less then or greater then best k value, it will affect the performance of kNN.

Also from observations it is clear that accuracy increases up to the best value of k and then after accuracy starts to decrease.

Comparison of traditional kNN and proposed kNN

To find best value of k suitable to the data set traditional kNN a lgorithm uses cross -validation.

The CROSSVA LIDATION in proposed kNN algorith m also specifies setting for performing V- fold cross-validation but for determining the best number of neighbors the process of cross -validation is not applied to all choice of v but stop when the best value is found. It can be observed from the results of effect of value of k in kNN that before and after achieving best value of k accuracy of classification decreases. Newly proposed kNN algorithm will reduce the number of iterations for finding out the best value of k. Due to decrease in number of iteration time and space needed to find best k are a lso decreased.

Data sets used to exa mine the effect of value of k are

used in comparison of traditional and proposed algorith m.

Data

set

Maxi

mu m

% of

correct

No. of

iteratio

No. of

iterati

Reduc

ed no.

Table 13 Tradit ional kNN v/s Proposed kNN

Figure 2 Traditional kNN v/s Proposed kNN

Conclusion

For short text c lassification kNN, NaÃ¯ve Bayes and SVM a lgorith ms can be used. Fro m the results in section 4 it is concluded that kNN g ive better accuracy than other two algorith ms. Also when kNN a lgorith m is used with attribute selection its accuracy for classification inc reases. kNN algorith m depends on the distance function and the value of k nearest neighbor, traditional kNN a lgorithm finds best value of k using cross-validation. But time and space used by it is larger due to unnecessary processing for each and every value of k fro m Ma ximu m k to 1. In p roposed kNN a lgorith m the unnecessary processing of cross -validation is reduced due to which time and space used for classification is also reduced. Table 13 shows reduced number of ite ration in p roposed algorith m. Proposed kNN is mo re promising than traditional kNN as larger dataset can be used for classification and time for evaluation and building model fo r large dataset is reduced.
References

Arzucan Â¨OzgÂ¨ur, Levent Â¨ OzgÂ¨ur, and Tunga GÂ¨ungÂ¨or Text Categorization with class-based and corpus-based keyword selection In Proceedings of the 20th international conference on Computer and Information Sciences, Publisher: Springer- Verlag October,2005
XindongWu, Vipin Kumar, J. Ross Quinlan , Joydeep

Ghosh , Qiang Yang, Hiroshi M otoda , Geoffrey J. M cLachlan, Angus Ng , Bing Liu , Philip S. Yu, Zhi- Hua Zhou, M ichael Steinbach, David J. Hand and Dan Steinberg Top 10 algorithms in data mining Journal Knowledge and Information Systems Volume 14 Issue 1,

December ,2007
Xuan-Hieu Phan, Le-M inh Nguyen, Susumu Horiguchi Learning to classify short and sparse text & web hidden topics from large-scale data collection In Proceedings of the 17th international conference on World Wide Web ACM New York, NY, USA,2008
Alfonso M arin Comparison of Automatic Classifiers Performances using Word-basd Feature Extraction Techniques in an E-government setting University essay from KTH/Skolan for informations- och kommunikationsteknik (ICT) January,2011
Victoria Bobicev, M arina Sokolova An Effective and Robust M ethod for Short Text Classification In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence volume 3, July 2008
Hitesh Sajnani, Sara Javanmardi, DavidW. M cDonald, Cristina V. Lopes M ulti-Label Classification of short text: A Study on wikipedia Barnstars Paper from AAAI- 11 workshop on Analyzing microtext 2011
Bengel, J, Gauch, S. M ittur, E. and Vijayaraghavan

R.Chat room topic detection using classification. In proceeding of 2nd Symposium on Intelligence and Security Informatics,2004
Yiming Yang. A study of thresholding strategies for text categorization. In proceeding of 24th International ACM SIGIR Conference 2001
Yang, Y. and Liu, X. A re-examination of text categorization methods In proceedings of the 22nd annual international ACM SIGIR conference on Research and Information Retrieval 1999
Wei Wang, Sujian Li, Chen Wang An Improved KNN Algorithm for Text Categorization In Proceedings of NTCIR-7 Workshop Meeting 2008
Ali-M ustafa Qamar, Eric Gaussier, Jean-Pierre Chevallet, Joo Hwee Lim, Similarity Learning for Nearest Neighbor Classification ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining 2008
Wen-tau Yih and Christopher M eek, Improving Similarity M easures for Short Segments of Text Proceedings of the 22nd nationa l conference on Artificial intelligence – Volume
Jasmine Kalathipparambil Sudhakaran, Ramaswamy Vasantha A M ixed M ethod Approach for Efficient Component Retrieval from a Component Repository

Journal of Software Engineering and Applications ,2011
Gautam Bhattacharya a, Koushik Ghosh b, Ananda S. Chowdhury An affinity -based new local distance function and similarity measure for kNN algorithm

Pattern Recognition Letters , Volume 33 Issue 3 Publisher: Elsevier Science Inc ,2012
M uhammed M iah Improved k-NN Algorithm for Text Classification Department of Computer Science and Engineering University of Texas at Arlington, TX,

USA,2003
Z. Xie, W. Hsu, Z. Liu, & M. Lee, SNNB: A selective neighborhood based Naive Bayes for lazy learning", Proc. 6th Pacific-Asia Conf. on KDD, Taipei, Taiwan, 2002
cross validation subcommand in kNN, http://publib.boulder.ibm.com/infocenter/spssstat/v20r0 m0/index.jsp

Improved kNN Algorithm by Optimizing Cross-validation

Distance Functi on: The distance function for measuring the difference or similarity between two instances is the standard Euclidean distance.

Selection of Value K: The neighborhood size is artific ially assigned as an input parameter.

Calcul ating Class Probability: The c lass probability estimation is based on a simple voting.

Distance function

Selection of value k

Example A: for better understanding of cross – validations in Tradit ional kNN:

Data set 1: Rev iew o f notebook No. of attributes: 7

Data set 2: Rev iew of Swiss hotel No. of attributes: 6

Data set 3: Auto No. of attributes: 21

Data set 4: Ford No. of attributes: 11

Aver age result of Data set -1

Aver age result of Data set -2

Aver age result of Data set -3

Aver age result of Data set -4

Leave a Reply