 Open Access
 Total Downloads : 372
 Authors : Nikhil R. Vyawahare, S. G. Lade
 Paper ID : IJERTV3IS21275
 Volume & Issue : Volume 03, Issue 02 (February 2014)
 Published (First Online): 10032014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Document Classification using Parallel Processing
Nikhil R. Vyawahare
MTech CSEIT
Vishwakarma Institute of technology,
666, Upper Indira Nagar, Bibvewadi, Pune, Maharashtra 411037.
S. G. Lade
Assistant Professor Vishwakarma Institute of technology,
666, Upper Indira Nagar, Bibvewadi, Pune, Maharashtra 411037.
Abstract The wide availability of huge number of documents from different domains, realtime and archrivals data documents increase as fast as or faster than computing power. Many classical algorithms were developed to solve classification problem but many of them usually have large time complexity and with increasing number of documents it is necessary to find algorithm which are able to find solution in reasonable time. Because sequential processor can perform a process at a time. The parallel processing is a novel technique for scaling up the algorithms. Parallel processing for document classification can take the advantage of increasing availability of multiprocessors or multicores processors. The Graphics processing unit is consists of multicore processor and they can perform parallel process in small amount of time.

INTRODUCTION
A large portion of the available information is stored in document database which consist of large collections of documents from various areas like news articles, research papers, books, ebooks, emails, and Web pages and these can be known as document database. Document databases are rapidly growing due to the increasing amount of information available in electronic form. Nowadays most of the information in government, small or large industry, business, and other institutions are store in the form of documents. Document classification is process of assigning predefine label or categories to the documents which include unstructured and semi structured information. It is important because, with existence of a tremendous number of documents, it is tedious and essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. Document classification is the primary process of retrieving, filtering, clustering and extracting documents. A general procedure for document classification is as follows: First, a set of preclassified documents is taken as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process. The so derived classification scheme can be used for classification of other new documents.
The implementation of document classification over the large set of documents on sequential processor (CPU) takes large amount of time for learning and classification. Many classical algorithms were developed to solve classification problem but many of them usually have large time complexity and with increasing number of documents it is necessary to find algorithm which are able to find solution in reasonable
time, Because CPU can perform a calculation one at a time. The parallel processing is a novel technique for scaling up the algorithms. Parallel processing for document classification can take the advantage of increasing availability of multi processors or multicores processors. The Graphics processing unit is consists of multiprocessor and they can perform parallel process in small amount of time. The GPU is not only the used by graphical application but also nongraphics application (like classification).
In this paper, we introduce the parallel process for document classification. For Implementation of such system we are using the NVIDIA Compute Unified Device Architecture, through a new API, an easy way to take advantage of the high performance of GPUs for parallel processing [2].
The rest of the paper is organized as follows. The section 2 describes process of document classification, section 3 introduce details about Graphic Processing Unit. Section 4 describes the algorithm for suggested in this paper for parallel processing. Section 5 contains description of experimental results. And remaining sections contains conclusion and references.

DOCUMENT CLASSIFICATION
Document classification is process of assigning the documents to predefined categories. Let, if a document di D belongs to the category ci C according to the knowledge of the correct categories known for subset DT D of training documents, where D is the collection of all documents and C is collection of all categories. Generally, each document may belong to more than one category and each category may contain more than one document [1].
The document classification task can be solved by automatic classifier. The documents need to be in the form of input that classifier accept. Automatic document classifier needs several preprocessing steps, which must convert document into classification capable form. Preprocessing of documents follows several steps, the first step in pre processing is preprocessing of words, a creation of vector representation of the documents. Each document is parsed out and list of used word with their frequency is extracted. Each word is compared with list of stopwords, which are useless and not take part in classification, because they are presents in most of the documents and it increase the length of document. Each word has to converted into it canonical form using
normalisation algorithm called as stemming, such as Porter Stemming Algorithm[23], it propose five rules to convert a word into its canonical form called as stem. A word is form by two components stem and its prefixes and postfixes, it will be more useful for classification if these prefixes and postfixes are removed. When these process is finished, the document collection is represented as document term frequency matrix ( Doci * TFi,j), where Doci refer to the each document in the collection and TFi,j refer to the frequency of j term in the document i. In this representation, only relation of the term to the individual document is concerned, but the classification across the document collection must be computed, therefore we need to evaluate term importance in individual document according the importance of the term in collection and it will called as weights. One of the way we may define a weights to the terms is according to TFIDF (term frequency inverse frequency). The weights wi,j of the term j in document i is computed as :
wi, j ti, j log F
fi
Where ti,j is the number of times that the j appears in document i,fi is the number of times that the term tj appears in the entire document database and F is the number of unique terms in document collection.
The previous paragraph described the process for conversion of document collection into (document term weight) matrix, but the number terms with nontrivial weight for each document is still large (tenths of thousands). This may cause problem with automatic classifiers because of the large number of terms to process. Therefore, feature/term selection were may be applied. Several approach to feature selection were developed in [23] and [24]. One of the popular approach is entropy weight scheme. The entropy weighting scheme is computed to the each term as a multiplication of the local weighting scheme Lij and the global weighting scheme Gj of the document i and term j. The definition of the scheme is following,
ij
L 1 log TFij,TFij 0
0, otherwise
which CPUs are designed to perform well therefore, one should expect that most applications will use both CPUs and GPUs, executing te sequential parts on the CPU and numerically intensive parts on the GPUs. This is why the [10] CUDA (Compute Unified Device Architecture) programming model, introduced by NVIDIA in 2007, is designed to support joint CPU/GPU execution of non graphics application.
IV. KNEAREST NEIGHBOUR (KNN)
KNearest Neighbour algorithm(kNN) is a classification technique in data mining. This algorithm is based on a majority vote of the k closest training samples in the feature space. If k=1, then the object is simply assigned to the class of its nearest neighbour. The ks value can be anything it is depending on what dataset is used for classification. Various measures (e.g. Euclidean distance, cosine measurement, KL) can be used to compute the distance between two data sample points in feature space, the most desirable distance metrics may differ in different applications.
K nearest neighbour or KNN[5] classification determines the decision boundary locally. For only one nearest neighbour we assign each document to the class of its closest neighbour. For KNN we assign each document to the majority class of its k closest neighbours where k is a parameter. The work of kNN classification is based on the contiguity hypothesis that we expect a test document d to have the same label as the training documents located in the local region surrounding d. The Voronoi tessellation of a set of objects space into Voronoi cells, each objects cell will consists of all points that are closer to the object than to other objects. In our case, the objects are documents. Then Voronoi tessellation partitions the plane into D set it seems like convex type of polygons, each containing its corresponding document , where a convex polygon is a convex region in 2dimensional space bounded by lines. For general k N in KNN, consider the region in the space for which the set of k nearest neighbours is the same. This again is a convex polygon and the space is
N
partitioned into convex polygons, within each of which the set of k nearest neighbours is invariant.
1
TFij log TFij
Gk
k 1
Fj Fk
log N

Parallel KNN
Where N is number of document in collection, TFij refers to the frequency of the j term in the document i, Fh is a frequency of term k in the entire document collection.
III. GPU
In GPU[12], the GPU takes advantage of a large number of execution threads to find work to do when some of them are waiting for longlatency memory accesses, thus minimizing the control logic required for each execution thread. Small cache memories are provided to help control the bandwidth requirements of these applications so multiple threads that access the same memory data do not need to all go to the DRAM. The GPUs are designed as numeric computing engines and they will not perform well on some tasks on
The parallelization of kNN is not applied on training period but on the prediction of the unknown instances, which is given as follows:

Let D be the dataset, Partition the dataset D into P blocks like D1,D2, .. ,DP , each processor will handles roughly D/P

Processor Pr calculates the k nearest neighbours Nr
with the local training samples Dr.

A global reduction computes the overall k nearest neighbours Nglobal from N1,…., NP, and then assign the object to the class which most common amongst Nglobal.


Algorithm
As describe in above, Parallel kNN, it partition data set D into P processors or cores and each core calculates k nearest neighbour for their local documents. The results from all these cores, gives the total k nearest neighbour for the test document, and this operation is perform by the CPU.
Training:

Document Collection into D dataset.

Tokenization of documents, it convert documents into set of tokens.

Remove the stop words from documents.

Partition D dataset into T threads, each thread. handles D/T number of documents.

Perform stemming for all threads.

Entropy feature selection.

Testing:
Following are the steps perform by each core to calculate K nearest neighbour.

Calculate term frequency TF.

Calculate inverse term frequency IDF, IDFi = log( N/dfi )

Calculate TFxIDF document vector, it also called as a weight,
Wi,j = tfij X IDFi

Calculate document vector length
m
Table 1 Technical Specification (NVIDIA GeForce 520M)
Technical specifications
Compute capability (version) 2.X
Maximum x, y, or zdimension of a grid of
thread blocks
65535
Maximum dimensionality of thread block
3
Maximum number of threads per block
1024
Warp size
32
Maximum number of resident blocks per
multiprocessor
8
Number of 32bit registers per multiprocessor
32 K
Maximum number of 32bit registers per
thread
64
Maximum amount of shared memory per
multiprocessor
48 KB
Number of shared memory banks
32
Amount of local memory per thread
512
Constant memory size
64 KB
VI. DOCUMENT COLLECTION
In our implementation, we selected five categories with high number of documents. There are 2000 text documents. These text documents are collected from the internet. All documents are belongs to the only that five categories. The categories are research domains/areas in computer science and engineering stream. These documents are related to those categories. The classes or categories are Artificial Intelligence, Text mining, Networking, Data mining, Image Processing. There are twelve hundred documents used for training purpose and other eight hundred documents are used for testing.
We experimented, this data set for several different values of k. The accuracy of the classifier is depends on the
 dj 
w2i. j i1
value of k. we experimented for k values with two to twenty and results are improved with greater value of k. Accuracy can
m

Calculate document cosine similarities, using
wij.wik
be calculated by using one of the most popular metrics in
document classification is precision and recall. Precision (Pr) is defined as a probability that selected document is classified
sim(dj, dk )
dj.dk
 dj . dk 
i0
 dj . dk 
correctly and recall (Re) is defined as a probability that randomly selected document is assigned to the correct category. Mathematical definitions are as follows:

Collect k samples nearer to the test sample.
Pr
TP TP FP
V. E
XPERIMENT AND
RESULTS
Re TP
TP FN
Where TP (true positive) is count of correctly classified
This section contains description of performed experiments on several data sets.
I. Hardware configuration
In our implementation we used a system with Intel core i5 processor @2.30 GHz with windows 7 operation system, nVIDIA GeForce 520M which has 96 CUDA cores and 2 GB RAM.
documents, FP (false positives) is count of documents incorrectly not assigned into category. And combination of the precision and recall is F1 measure, it can be calculated as,
F1 2 Pr Re
Pr Re
Following table shows the efficiency of the algorithm in terms of precision, recall and F1 for each categories or class.
Table 2 Efficiency of document classification for collected dataset
Class 
Pr 
Re 
F1 
AI 
0.271 
0.323 
0.294 
TM 
0.033 
0.064 
0.435 
DM 
0.703 
0.391 
0.502 
IP 
0.865 
0.937 
0.899 
NE 
1.000 
0.869 
0.929 
VII. CONCLUSION
This paper described a document classification algorithm using KNN and implemented on GPU. The parallel processing is a novel technique for scaling up the algorithms. Parallel classification algorithm is developing to take advantage of the increasing availability of multiprocessor. GPU is good for parallel processing operations and faster than CPU. Implementation of document classification reduces time complexity for learning and testing a huge set of documents.
ACKNOWLEDGEMENT
We are thankful to the computer department of Vishwakarma Institute of Technology, Pune for their valuable support.
REFERENCES

Vandana Korde,C Namrata Mahender,"Text Classification And Classifiers: A Survey", International Journal Of Artificial Intelligence & Applications (IJAIA), Vol.3, No.2, March 2012.

David B. Kirk and Wenmei W. Hwu,"Programming Massively Parallel Processors A Handson Approach"

Jan Platos, Vaclav snasel, Tomas Jezowicz, Pavel, Ajith Abraham "A PSOBased Document Classification Algorithm accelerated by the CUDA Plateform" in IEEE International Conference on Systems, Man and Cybernetics, October 1417, 2012, COEX, Seoul, Korea.

Han Xiao "Towards Parallel and Distributed Computing in LargeScale Data Mining: A Survey" April 8, 2010.

M. Connor and P. Kumar. Parallel construction of knearest neighbour graphs for point clouds. In Eurographics Symposium on PointBased Graphics, 2008.

Jesse St. Charles, Robert M. Patton, Thomas E. Potok, And Xiaohui Cui "FlockingBased Document Clustering On The Graphics Processing Unit" published in U.S. Department of Energy Journal of Undergraduate Research, 2009.

ThanhNghi Do, VanHoa Nguyen, and FranÃ§ois Poulet "Speed Up SVM Algorithm for Massive Classification Tasks", ADMA 2008, LNAI 5139, pp. 147157, 2008.

Joseph M. Cavanagh, Thomas E. Potok, Xiaohui Cui "Parallel Latent Semantic Analysis using a Graphics Processing Unit", GECCO09, July 812, 2009, MontrÃ©al QuÃ©bec, Canada.

Yongpeng Zhang, Frank Mueller, Xiaohui Cui and Thomas Potok "GPUAccelerated Text Mining" in EPHAM09, March 2225, 2009, Seattle, Washington.

Bryan Christopher Catanzaro, Narayanan Sundaram and Kurt Keutzer "Fast Support Vector Machine Training and Classification on Graphics Processors" UCB/EECS200811.

T.L. Griffiths, M. Steyvers, D.M. Blei, and J.B. Tenenbaum. Integrating topics and syntax. Advances in neural information processing systems, 17:537544, 2005.

S. Manavski and G. Valle. CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment. BMC bioinformatics, 9(Suppl 2):S10, 2008.

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. Advances in Neural Information Processing Systems, 20:10811088, 2007.

Y.Wang, H. Bai, M. Stanton,W.Y. Chen, and E.Y. Chang. PLDA: Parallel Latent Dirichlet Allocation for LargeScale Applications. AAIM, June, 2009.

Reza Farivar, Daniel Rebolledo, Ellick Chan, Roy Campbell "A Parallel Implementation of KMeans Clustering on GPUs", 2009.

Erik Lindholm, John Nickolls, Stuart Oberman Nvidia Tesla:Aunified Graphics And Computing Architecture Published by the IEEE Computer Society.02721732/2008 IEEE.

J. Montrym and H. Moreton, The GeForce 6800, IEEE Micro, vol. 25, no. 2, Mar./ Apr. 2005, pp. 4151.

J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proceedings of the International Conference on Very Large Data Bases, pages 544555. Citeseer, 1996.

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning, pages 104111. ACM, 2008.

E.Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Psvm: Parallelizing support vector machines on distributed computers. Advances in Neural Information Processing Systems, 20, 2007.

R. Jin and G. Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining. Citeseer, 2001.

Wei Zhanga, Feng Gao,"An Improvement to Naive Bayes for Text Classification", doi : 10.1016 / j.proeng . 2011 . 08 . 404.

Ms. Anjali Ganesh Jivani,"A Comparative Study of Stemming Algorithms",IJCTA  NOVDEC 2011.

Y. Saeys, I. Inza, and P. Larraaga, A review of feature selection techniques in bioinformatics, Bioinformatics, vol. 23, no, 19, pp. 2507 2517, 2007.

Y. Yang and J. Pedersen, Feature selection in statistical learning of text categorization, Morgan Kaufmann, 1997, pp, 412420.