Comparison of SVM and Naive Bayes Text Classification Algorithms using WEKA

Nitin Rajvanshi; K.R.Chowdhary

doi:10.17577/IJERTV6IS090084

Volume 06, Issue 09 (September 2017)

Comparison of SVM and Naive Bayes Text Classification Algorithms using WEKA

DOI : 10.17577/IJERTV6IS090084

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 269
Total Downloads : 264
Authors : Nitin Rajvanshi, K.R.Chowdhary,
Paper ID : IJERTV6IS090084
Volume & Issue : Volume 06, Issue 09 (September 2017)
DOI : http://dx.doi.org/10.17577/IJERTV6IS090084
Published (First Online): 14-09-2017
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comparison of SVM and Naive Bayes Text Classification Algorithms using WEKA

Nitin Rajvanshi,

Research Scholar

Dept. of Computer Sci. & Engg,

M.B.M. Engineering College, Jodhpur

Dr. K. R. Chowdhary,

Director,

Jodhpur Institute of Engineering & Technology – SETG, Visiting faculty of IITJ, Formerly, Professor & Head, Dept of CSE,

J.N.V. University, Jodhpur, India.

Abstract Due to the growing amount of textual data, automatic methods for managing the data are needed. Automated text classification has been considered as a vital method to manage and process a large amount of documents in digital formats that are continuously increasing at an exponential rate. In general, text classification plays an important role in information extraction, summarization and text retrieval. This paper illustrates the text classification process using SVM and NaÃ¯ve Bayes techniques. It automatically assigns documents to a set of classes based on the textual content of the document. In this paper after feature selection of text, machine learning algorithms NaÃ¯ve Bayesian, Support Vector Machine(SVM) are applied. Evaluation and Comparison of algorithms is depicted. Topic-based text categorization classifies documents according to their topics. Performed through WEKA tool

Index Terms Machine Learning, Feature Selection, Stop words, NaÃ¯ve Bayesian, Support Vector Machine (SVM), WEKA

1 .INTRODUCTION

Texts can also be written in many classes or species, for instance: scientific articles, news reports, movie reviews, and advertisements. Text Classification is the task of classifying a document under a predefined category. More formally, if di is a document of the entire set of documents D and {C1, C2,.,Cn } is the set of all the classes, then text classification assigns one class Cj to a document di. Text Classification process is defined as eight stage process namely: Read Document, Tokenize Text, Stemming, Stopwords Removal, Vector Representation of text, Feature Selection or Feature Transformation and Learning Algorithms. The level of difficulty of text classification tasks naturally varies. As the number of distinct classes increases, so does the difficulty. WEKA tool is used here for comparative analysis of SVM and NaÃ¯ve Bayes classification algorithms.

TEXT CLASSIFICATION

Generally, in the text classification task, a document is expressed as a vector of many dimensions,

x = (x1, x2,…,xl). Each feature of a document vector has two values: whether a certain word appears in the document and the real value that is weighted by a suitable method, for example, TF-IDF.

For example, the following two documents, All-star game will held in Jodhpur (document 1)

And Chess is the champion of games (document 2) are expressed as x1, x2 (Fig. 1) using the four word features all- star, Jodhpur, chess, game.

all-star Jodhpur chess game

document 1 (x1)

1

1

0

1

document 2 (x2)

0

0

1

1

Figure 1: Vector representation of two documents.

In the example above, the document is expressed by a 4 dimensional feature vector. However, it is desirable to use at least 10,000 features, or as many as possible, to classify various documents at a high accuracy. However, when most machine learning techniques are used, having many features causes overlearning and a very long calculation time. In order to avoid these problems, several feature selection methods have been proposed to cut down the features from 100 to 10,000 by using various evaluation standards such as word appearance frequency, document frequency, mutual information, and information profit. On the other hand, a class label y is given, which stands for which class the document belongs to. The number of classes can be two or more. A two-class case, which solves whether a document belongs to a class or not, is the easiest case and is called a binary-class problem. A three-class or more case is called a multi-class problem. Also, the problem can be divided into two cases; namely, the case where a document has only one label and the case where a document has two or more labels, called multi-label. Generally, multi-class or multi- label classification problems are solved by combining many binary-class classifiers.
TEXT CLASSIFICATION ALGORITHMS Machine Learning and Natural Language processing

techniques can be used for the categorization. Some of the existing categorization methods include decision trees, decision rules, k-nearest neighbor, Bayesian approach, neural networks, regression-based methods, vector-based method etc. In this paper a comparative analysis of NaÃ¯ve Bayes (NB) and Support Vector Machines(SVM) is done.
Support Vector Machine (SVM) is primarily a classier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. For categorical variables a dummy variable is created with case values as either 0 or 1. Thus, a categorical dependent variable consisting of three levels, say (A, B, C), is represented by a set of three dummy variables:

A: {1 0 0}, B: {0 1 0}, C: {0 0 1}

To construct an optimal hyperplane, SVM employs an iterative training algorithm, which is used to minimize an error function. According to the form of the error function, SVM models can be classified into four distinct groups:
Classification SVM Type 1 (also known as C-SVM classification)
Classification SVM Type 2 (also known as nu-SVM classification)
Regression SVM Type 1 (also known as epsilon-SVM regression)
Regression SVM Type 2 (also known as nu-SVM regression)

There are number of kernels that can be used in Support Vector Machines models. These include linear, polynomial, radial basis function (RBF) and sigmoid.

The RBF is by far the most popular choice of kernel types used in Support Vector Machines. This is mainly because of their localized and finite responses across the entire range of the real x-axis. In order to use Radial Basis Function (RBF), it needed to specify the hidden unit activation function, the number of processing units, a criteria for modelling the given a training finding the algorithm for finding out the parameters.

EXPERIMENTAL DETAILS

4.1. Data

Here , the data is of Car Dataset available from UCI repository( https://archive.ics.uci.edu/ml/machine-learning- databases/car/car.data ).

4.1.1 Description of Data

The Car Dataset relates CAR to the six input attributes: buying, maint (for maintainence), doors, persons, lug_boot (boot space) , safety.

Attribute Information:

Class Values: 4

unacc, acc, good, vgood Attributes: 6

buying: vhigh, high, med, low. maint: vhigh, high, med, low. doors: 2, 3, 4, 5more.

persons: 2, 4, more. lug_boot: small, med, big. safety: low, med, high.

Total number of instances of data were 1728.
There are 6 attributes namely- buying capacity, maintenance, number of doors, seating Capacity, boot space, safety and class and there were 1728 total instances of the dataset.

Table 1

Result from WEKA for Car Dataset

Algorithm

CCI (%)

ICI(%)

KS

MAE

RMSE

NaÃ¯ve Bayes

85.53

14.47

0.6665

0.1137

0.2262

Radial Biased Function (RBF-SVM)

94.21

5.79

0.8752

0.676

0.1571

From the above results obtained for dataset, it is clearly shown that RBF (SVM) outperforms the NaÃ¯ve Bayes algorithm. The Kappa statistic for RBF is close to perfect agreement (i.e. 0.8752). It has got 94.21% that is more as

compared NaÃ¯ve Bayes for correctly classified Instances. Moreover, RBF has got lesser mean absolute error and root mean square error.

1

0.8

0.6

0.4

0.2

0

KS

MAE

RMSE

NaÃ¯ve Bayes

RBF (SVM)

Figure 2: Graph of KS, MAE and RMSE for Dataset
CONCLUSION

In this paper two machine learning algorithms SVM (RBF) and NaÃ¯ve Bayes have been evaluated and compared in WEKA. It can be inferred that none of the algorithm is perfect.

Since here a single dataset was considered, so its variation with different parameters were compared and evaluated. By increasing the datasets, results can be finer. Further, it can be said that different text classification algorithms work efficiently and may show different behavior for different datasets. The accuracy of predictive model is affected by the attributes of the data chosen.

REFERENCES

Daniel T. Larose, Data Mining Methods and Models, John Wiley & Sons, INC Publication, Hoboken, New Jersey (2006).
Ryan Potter, Comparison of Classification Algorithms Applied to Breast Cancer Diagnosis and Prognosis, Wiley Expert Systems, 24(1), 17-31, (2007).
Yoav Freund and Llew Mason, The Alternative Decision Tree Learning Algorithm International Conference on Machine Learning, 124-133, (1999).
Xindog Wu, Vipin Kumar et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(1), 1-37 (2008).
Sebastiani, Machine Learning in automated text categorization, ACM Computer Surveys, Vol. 34, March 2002
Bharat Deshmukh, Ajay S. Patil, B.V.Pawar, Comparison of classification algorithms using WEKA on various Datasets, International Journal of Computer Science and Information Technology.
Adrian G. Bors, I. Pitas, Introduction to RBF Network, Online Symposium for Electronics Engineers, 1(1), 1-7 (2001).

document 1 (x1)	1	1	0	1
document 2 (x2)	0	0	1	1

Algorithm	CCI (%)	ICI(%)	KS	MAE	RMSE
NaÃ¯ve Bayes	85.53	14.47	0.6665	0.1137	0.2262
Radial Biased Function (RBF-SVM)	94.21	5.79	0.8752	0.676	0.1571

1
0.8
0.6
0.4
0.2
0
0
	KS	MAE	RMSE
	NaÃ¯ve Bayes		RBF (SVM)

Comparison of SVM and Naive Bayes Text Classification Algorithms using WEKA

Leave a Reply