Multilingual Text Classification

Sonam Mittal; Prof.  Praveen Dhyani

doi:10.17577/IJERTV4IS030032

Volume 04, Issue 03 (March 2015)

Multilingual Text Classification

DOI : 10.17577/IJERTV4IS030032

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 146
Total Downloads : 251
Authors : Sonam Mittal, Prof. Praveen Dhyani
Paper ID : IJERTV4IS030032
Volume & Issue : Volume 04, Issue 03 (March 2015)
DOI : http://dx.doi.org/10.17577/IJERTV4IS030032
Published (First Online): 11-03-2015
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Multilingual Text Classification

Sonam Mittal

Computer Science Dept.

B K Birla Institute of Engineering & Technology Pilani, Rajasthan, India

Abstract – Identifying the language used for a document will typically be the first step to most of the Natural Language Processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Canvar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. Multilingual Text Classification using N- gram techniques seems to have produced very interesting results in the field of text categorization not only for the languages like English and French but equally good for more difficult to classify languages like Spanish, Italian, German and Russian.

Keywords Multilingual Text, N-gram, tf-idf, frequency, similarity, classification, prediction, classifier

INTRODUCTION

Automated text categorization is a supervised learning task, defined as assigning category labels to new documents based on the likelihood suggested by a set of labeled documents. Classifying the language of the documents requires several essential steps like preprocessing the text to obtain terms, identifying important terms and a classifier (in this case NaÃ¯ve-Bayes classifier is used) [2].

Language classification is an important task for todays World Wide Web where an increasing number of documents are in languages other than English. Language classification finds use in area like search engine indexing, text mining, spam filtering and other applications that apply language specific algorithms. Such classification is a key step in processing a large document streams and is a data intensive task [3].

Prof. Praveen Dhyani Executive Director Banasthali Vidyapith Jaipur, India

First we will see each of these phases in terms of the functions performed and then we will present the approach to execute the code.
METHODOLOGY

The proposed system is able to predict the language of an incoming document. The languages taken for this system are English, Spanish, and Italian taken from [4]. The system is first trained (with 20% corpus) and is then tested for 80% of the corpus. The efficiency of the system comes out to be 99%. The documents are in XML file format. The motive behind choosing Spanish and Italian was their extreme closeness in words and hence it makes the classification task even more challenging. The whole task of classification consists of four phases:
- Document Preprocessing
- TF-IDF Analysis
- Training the Model
- Testing the Model
The inverse document frequency is a measure of whether the term is common or rare across all the documents. It is obtained by:

Finally, the tf-idf is calculated by:

TF-IDF = tf Ã— idf

The result of this phase is present in the sub folder TFIDF Analyser results under the Profiles folder. The value of TF- IDF is used as a classifier.

Based on the TF-IDF values of the term, we select a set of say 25 terms with highest frequency (or TF-IDF value) for each category. These are also called the keywords for the category. This collection of the highest frequency terms with their tf-idf values is called dictionary for each category. It is used in the phase of testing where the category of the test document is predicted.
Set the value of N, size and choice in Main.py

Preprocessing.py, remove noise, Generate document level profiles and category profiles

TFIDF_Analyzer.py, Calculate TFIDF values, Generate Keywords and Dictionary

Training.py, Calculates the probability map, train the model by reading training files

Testing.py, Uses the trained model to predict the language category to test documents using NaÃ¯ve Bayes Classifier.

Fig 1: Architecture of Language Classification System
ORGANIZATION AND EXECUTION OF THE

PROGRAM

Organization

The code for this multilingual classifier is organized into 4 different files (just as the phses described in the previous section). The file preprocessing.py is responsible for the pre- processing part of the program. It has modules or functions for removing noise, calculating n-grams (if value of n>0) and to generate the document level profile as explained in the preprocessing section. The file TFIDF analyzer does the part for calculating the TF-IDF values for the pre-processed files. It has modules to count the terms and their respective frequencies. Training.py reads the files from each category sub-folder (en, es, it) under the parent folder called Training set. It calls the modules from the folder Testing set. The format of test-file names are already described in the previous section.
Execution

The given program can be run in two modes:
- Normal Mode: In this mode, there is only one fixed value for N-grams (the variable N) and size of keywords dictionary (size) which is to be initialized in the Main.py. This mode can be selected by setting the value of variable choice to 0.
- Mega Run Mode: When choice variable is set to 1, the code runs in mega run mode. In this mode, there is a list of N-gram values and a list of sizes for the dictionary i.e. the code executes with various different configurations and hence is very time consuming (with the given values, it takes 10 hours to complete).
- System Requirements : Linux (Ubuntu 13.04 is used)
- Packages: numpy, pyngram
- The documents should be tested/predicted are to be placed in the folder Testing Set, should be XML files, should have two initial characters of the original language category just before the .XML part of the file name. Some examples of test files are: hello es.xml, irtest it.xml etc.
Observations

The efficiency of the system is 65% when no n-gram is applied and the size of keywords is 25. With 4 n-grams the efficiency increases to 96% with the same size of keywords. 10 fold cross validation can give better results. Logically, the larger the size of keyword list better is the efficiency. Value of n in n-grams can significantly change the performance of the system but as per the observations, 3 and 4 are the best values to be considered.

REFERENCES

N Gram Based Text Categorization William B. Canvar, John

M. Trenkle, 1994.
Is NaÃ¯ve Bayes a Good Classifier for Document Classification

S.L. Ting, W.H. Ip, Albert H.C. Tsang, International Journal of Software Engineering & Its Applications, Vol. 5 No. 3, July 2011.
Multilingual Text Categorization using character N gram Suzuki M., Yamagishi N. , Yi Ching Tsai, Hirasawa S., Soft Computing in Industrial Applications, 2008.
http://optima.jrc.it/Acquis/JRC-Acquis 3.0/corpus/

Multilingual Text Classification

Leave a Reply