Text Classification using Support Vector Machine

DOI : 10.17577/IJERTV1IS3174

Download Full-Text PDF Cite this Publication

Text Only Version

Text Classification using Support Vector Machine

Text Classification using Support Vector Machine

Karuna P. Ukey

Depart ment of IT PRM IT & R, Badnera Amravati, Ind ia

Dr. A.S. Alvi

Depart ment of I.T. PRM IT & R, Bandera Amravati, Ind ia

Abstract Text categorization is the task of automatically sorting text documents into a set of predefined classes. Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Semantic Analysis will be used for feature extraction, eliminating the text representation errors caused by synonyms and polysemes, and reducing the dimension of text vector.

Keywords-bag of words;featu re extraction;support vector machine.

  1. INT RODUCTION

    TC is the task of assigning documents expressed in natural language into one or mo re categories belonging to a predefined set. As more and more informat ion is availab le on the internet, there is an ever growing interest in assisting people manages the huge amount of information. Informat ion routing/filtering, identification of objectionable materials or junk ma il, structured search/browsing, and topic identification, etc, these are all hot spots in current informat ion manage ment. The assignment of texts to some predefined categories based on their content, namely Te xt Categorizat ion (TC), is an important component among these tasks

    Te xt representation is a necessary procedure for text categorizat ion tasks. Currently, bag of words (BOW) is the most widely used text representation method but it suffers fro m two dra wbacks. First, the quantity of words is huge; second, it is not feasible to calcu late the relat ionship between words. Se mantic analysis (SA) techniques help BOW overcome these two drawbacks by interpreting words and documents in a space of concepts.one advantage that SVMs offer fo r TC is that dimensionality reduction is usually not needed, as SVMs tend to be fairly robust to overfitting and can scale up to considerable dimensionalit ies. Recent e xtensive e xperiments also indicate that feature selection tends to be detrimental to the performance of SVMs. For application developers, this interest is main ly due to the enormously increased need to handle la rger and la rger quantities of documents, a need emphasized by increased connectivity and availability of docu ment bases of all types at all levels in the informat ion chain. But this interest is also due to the fact that TC techniques have reached accuracy levels that rival the performance of tra ined professionals, and these accuracy levels can be achieved with h igh levels of effic iency on standard hardware / software resources. This means that more

    and more organizations are automating a ll their activit ies that can be cast as TC tasks .

  2. PHASES IN LIFE CYCLE OF TEXT CLASSIFICATION

    There are three phases in the life cycle of te xt classification. These are document inde xing, classifie r lea rning and c lassifier evaluation.

    Document indexing denotes the activity of mapping a document dj into a co mpact representation of its content that can be directly interpreted (i) by a c lassifier bu ild ing algorith m and (ii) by a c lassifier, once it has been built. An inde xing method is characterized by (i) a defin ition of what a term is, and (ii) a method to compute term we ights. Concerning (i), the most frequent choice is to identify terms e ither with the words occurring in the document or with their stems. A popular choice is to add to the set of words or stems a set of phrases,

    i.e. longer (and semantica lly mo re significant) language units

    e xtracted fro m the te xt by shallow parsing and/or statistical techniques. Concerning (ii), term we ights may be binary- valued or rea l-valued, depending on whether the classifie r- building algorithm and the classifie rs, once they have been built, require binary input or not. When we ights are binary, these simply indicate presence/absence of the term in the document. When weights are non-binary, they are computed by either statistical or probabilistic techniques , the former being the most common option.

    A te xt classifier for ci is automatica lly generated by a general inductive process (the learner) which, by observing the characteristics of a set of documents preclassified under ci or ci, gleans the characteristics that a new unseen document should have in order to belong to ci. In order to build classifiers for C, one thus needs a set of documents such that the value of (dj, ci) is known for every (dj, ci) × C. Classifier Eval uation

    Training efficiency (i.e. average time required to build a classifier i fro m a g iven corpus ), as well as classification efficiency (i.e. average time required to c lassify a document by means of i), and effectiveness (i.e. average correctness of is classification behaviour) are all legitimate measures of success for a learner.

    In TC research, effectiveness is usually considered the most important criterion, since it is the most reliable one when it co mes to experimentally co mparing different learners

    or diffe rent TC methodologies, given that efficiency depends on too volatile para meters (e.g. diffe rent sw/hw platforms). In TC applications, however, all three para meters are important, and one must carefully look for a tradeoff a mong them, depending on the application constraints. For instance, in applications involving interaction with the user, a classifier with low c lassification efficiency is unsuitable.

  3. SUPPORT VECTOR MACHINE

    SVM is an effective technique for c lassifying high – dimensional data. Unlike the nearest neighbour classifier, SVM learns the optima l hyper plane that separates training e xa mples fro m d iffe rent classes by ma ximizing the classification marg in. It is also applicable to data sets with nonlinear decision surfaces by e mploy ing a technique known as the kernel tric k, which projects the input data to a higher dimensional feature space, where a linear separating hyperplane can be found. SVM avoids the costly similarity computation in high-dimensional feature space by using a surrogate kernel function. It is known that support vector mach ines (SVM) a re capable of effectively processing feature vectors of some 10 000 dimensions, given that these are sparse. Several authors have shown, that support vector mach ines provide a fast and effective means for learning te xt classifyers fro m e xa mp les. Docu ments of a given topic could be identifyed with high accuracy

    Support Vector Machine (SVM ) is supervised learning method for classification to find out the linear separating hyperplane which ma ximize the marg in, i.e., the optima l separating hyperplane (OSH) and ma ximizes the ma rgin between the two data sets. An optima l SVM a lgorith m via multip le optimal strategies is developed in presented latest technique for documents classification. A mong all the classification techniques SVM and Naïve Bayes has been recognized as one of the most effect ive and wide ly used text classification methods provide a comprehensive comparison of supervised machine learn ing methods for text c lassification .

    One re ma rkable property of SVMs is that their ability to learn can be independent of the dimensionality of the feature space. SVMs measure the comple xity of hypotheses based on the ma rgin with wh ich they separate the data, not the number o f features. This means that we can generalize even in the presence of many features, if our data is separable with a wide ma rgin using functions from the hpothesis space. The same margin argu ment also suggests a heuristic for selecting good parameter settings for the learner (like the ke rnel width in an RBF netwo rk). The best parameter setting is the one which produces the hypothesis with the lowest VC- Dimension. This allows fu lly auto matic para meter tuning without e xpensive cross -validation.

    Why Should SVMs Work Well for Te xt Categorizat ion?

    To find out what methods are promising for learn ing text classifiers, we should find out mo re about the properties of te xt.

    not necessarily depend on the number of features, they have the potential to handle these large feature spaces.

    A classifier using only that \worst feature has a performance much better than random. Since it seems unlike ly that all those features are completely redundant, this leads to the conjecture that a good classifier should combine many features (learn a

    \dense" concept) and that aggressive feature selection may result in a loss of information.

  4. PROPOSED SYST EM DESIGN

    The work will be carried out as follows.

    1. Analysis of available te xt c lassification systems.

    2. Imple mentation of te xt pre-processor.

    3. Feature e xtraction using semantic analysis.

    4. Vectorization of te xt .

    5. Finally c lassification of te xt using SVM c lassifier.

    6. Co mparison of system with a lready available systems.

    7. Performance Evaluation and result analysis.

To find out what methods are promising for lea rning te xt classifiers, we should find out mo re about the properties of text.

Te xts are unstructured and use the natural language of humans, which ma ke its semantics difficult for the computer to dea l with. So they need necessary pre-processing. Te xt pre-processing ma inly segments texts into words.

LSA is used in this module for the feature e xt raction

and the dimensionality reduction of word -docu ment matrix of training set. K la rgest singular values and corresponding singular vectors are e xtracted by the singular va lue decomposition of word-document matrix, to constitute a new matrix for appro ximately representation of the origina l word – document matrix. Co mpared with VSM, it can reflect the semantic lin k between words and the impact of contexts on word mean ings, eliminate the d iscrepancy of text representation caused by synonyms and polysemes, and reduce the dimension of te xt vectors.

In this model, each row vector of the word-document matrix represents a text that is the vectorization of te xt. During a testing procedure, after each test sample segmented into words, the initia l te xt vectors are mapped to a latent semantic space in this module by LSA vector space model, to generate new te xt vectors.

Finally, the ne w te xt vectors are c lassified in IHS-SVM classification module. IHS-SVM is an improve ment for HS- SVM, both of which will use a min imu m enclosing ball to define each type of te xt. When determining categories, HS- SVM finds which hyper-sphere is the closest one to the test sample, and then the category it stands for is the one the test sample be longs to. However, the te xts in overlapping reg ions cannot be classified correctly by this way. IHS-SVM divides samples into three types: those not in any hyper-sphere, those only contained in one, and those included in mu ltiples. The classification of the first two types is same to HS-SVM. It compares the concentration of the test sample to each hyper- sphere, and then classes the sample to the highest one.

The process of feature extract ion is to make c lear the border of each language structure and to eliminate as much as possible the language dependent factors, tokenization, stop words re moval, and stemming. Feature Ext raction is fist step of pre processing which is used to presents the text docu ments into clear word format. Re moving stops words and stemming words is the pre-processing tasks. The documents in te xt classification are represented by a great amount of feature and most of then could be irrelevant or noisy. Dimension reduction is the e xclusion of a la rge nu mber of key words, base preferably on a statistical criterision, to create a low dimension vector. Dimension Reduction techniques have attached much attention recently science effective dimension reduction make the learning task such as classification mo re e fficient and save more storage space. Co mmon ly the steeps taken please for the feature e xtract ions are: To kenization: A document is treated as a string and then partitioned into a list of tokens. Re moving stop words: Stop wo rds such as the, a, and etc are frequently occurring, so the insignificant words need to be re moved. Stemming word : Applying the stemming algorith m that converts different word form into similar canonical form. This step is the process of conflating tokens to their root form eg. Connection to connect, computing to compute etc.

VSM based on text key words quantizes document vector with the weights of the words, having high efficiency and easy to use. However, it only counts the frequency of the words, wh ile ignoring the semantic lin k a mong them and the impact of conte xt on their meanings. Thus texts similarity depends only on the number of the same wo rds they contained, which reduces the class ification accuracy with the e xistence of polysemes and synonyms. In addition, the text matrixes constructed by VSM are generally high -dimensional sparse matrices, inefficient in tra ining and c lassification and not suitable for handling large-scale te xt sets. However, LSA

can effectively solve these limitations. It believes that there is a latent semantic structure between words of one te xt. And it hides in their context usage patterns. So, k la rgest singular values and their corresponding singular vectors are extracted by the singular value decomposition of word -document matrix, to constitute a new matrix for the appro ximate presentation of word-document matrix of the orig inal documents set. Te xt presented by high -dimensional VSM is thus mapped into a low-dimensional latent semantic space. You can e xt ract latent semantic structure without the impact of the correlation between the words to get high text representation accuracy. LSA is based on singular value decomposition. It maps texts and words form a high – dimensional vector space to a low one, reducing text dimensions and improving te xt representation accuracy.

Step1: Construct a word-document mat rix A. In the LSA

model, a te xt set can be expressed as a word-document matrix of m× n (m is the number of entries contained in a te xt, n is the number of te xts).

Step2: Deco mpose singular value. A is decomposed into the mutip ly of three matrices: U , S , V . U ' and V ' are orthogonal matrices, S ' is a diagonal matrix of singular value. Retain the rows and the columns of S ' containing K la rgest single-values to get a new diagonal matrix. Then retain the same part of U ' and V ' to get U and V. Thus, construct a new word-document matrix R =USV T. For a te xt d, words are screened by singular value decomposition to form new vectors to replace the orig inal te xt feature vectors. It ignores the factors of smalle r influence and less importance. Key -words that dont appear in the te xt will be represented in the new word-document matrix if they are associated with the te xt semantics. Therefore, the new matrix re flects the potential semantic re lation a mong keywords fro m a nu merica l point of view. It is closest to the original term frequency matrix with the least-squares. Meaning of each dimension in vector space is greatly changed in process. It reflects a strengthened semantic re lationship instead of simple appearance frequency and distribution relat ionship of entries . And the dimension reduction of vector space can effectively imp rove the classification speed of text sets .

CONCLUSION

It can be possible o develop a text classification system using support vector machine and semantic analysis for documents. By using the support vector mach ine and semantic analysis the system can give mo re accurate result.

Automated text classification is attractive because it frees organizations fro m the need of manually organizing document bases, which can be too e xpensive, or simp ly not feasible given the time constraints of the application or the number of documents involved. The accuracy of modern te xt classification systems rivals that of trained human professionals.

ACKNOWLEDGMENT

First & fo re most, I would like to e xp ress my sincere gratitude towards my project guide, Pr of. A. S. Al vi for his valuable guidance, encouragement, and optimism. I feel proud of his e minence and vast knowledge, which will guide me throughout my life.

REFERENCES

  1. Fabrizio Sebastiani, Text Categorization; In Alessandro Zanasied. Text Mining and Its Application.Southampton: WIT Press, pp.109 -129, 2005.

  2. Evgeniy Gabrilovich, Shaul Markovitch. T ext Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5. Proceedings of the 21 st International Conference on Machine Learning, Ban®, Canada, 2004.

  3. ] C.H.Li, An efficient document categorization model based on LSA and BPNN, Sixth International Conference on ALPIT, pp.9-14, 2007

  4. An Overview of E-Documents Classification Aurangzeb Khan ,Baharum B. Bahurdin, Khairullah Khan Department of Computer & Information Science Universiti Teknologi, PETRONAS 2009 Internat ional Conference on Machine Learning and Computing IPCSIT vol.3 (2011) © (2011) IACSIT Press, Singapore

  5. Text Categorization with Support Vector Machines: Learning with Many Relevant Features Thorsten Joachims Universit at Dortmund Informatik LS8, Baroper Str. 301 44221 Dortmund, Germany.

  6. ] Research of T ext Classification Model Based on Latent Semantic Analysis and Improved HS-SVM Yu-feng Zhang Center for Studies of Information Resources Wuhan University Wuhan, China.

  7. Efficient Algorithm for Localized Support Vector Machine Haibin Cheng, Pang-Ning Tan, Member, IEEE, and Rong Jin, Member, IEEE

  8. A Graph-Based Approach for Multi-Folder Email Classification,Sharma Chakravarthy

  9. Department Of Comp. Sci. & Engg. Univ. of Texas at Arlington Arlington, TX, USA sharmac@uta.edu

Leave a Reply