Feature Clustering algorithms for text classification-Novel Techniques and Reviews

Bhagyasri. A; P. Praneetha; I.Kali Pradeep; G. Ravindra Bharathi

doi:10.17577/IJERTV2IS100122

Volume 02, Issue 10 (October 2013)

Feature Clustering algorithms for text classification-Novel Techniques and Reviews

DOI : 10.17577/IJERTV2IS100122

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 76
Total Downloads : 451
Authors : Bhagyasri. A, P. Praneetha, I.Kali Pradeep, G. Ravindra Bharathi
Paper ID : IJERTV2IS100122
Volume & Issue : Volume 02, Issue 10 (October 2013)
Published (First Online): 03-10-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Feature Clustering algorithms for text classification-Novel Techniques and Reviews

Bhagyasri. A	P. Praneetha	I.Kali Pradeep
M.Tech Scholar	Assistant Professor	Assistant Professor
Department of CSE	Department of CSE	Department of CSE
SCET, Narasapur, AP, India	SCET, Narasapur, AP, India	VITB, Bhimavaram

G. Ravindra Bharathi Assistant Professor Department of CSE VITB, Bhimavaram

Abstract

Classifying text is a challenging technique. The dimensionality of feature vectors is very huge and number of researchers has found many techniques to reduce the number of dimensions of feature vectors in text classification. In this paper, some of the important techniques for text classification have been reviewed and novel parameters using fuzzy set approach have been discussed in detail. Here, the words are grouped into clusters based on degree of similarity. Mean and standard deviation are considered as membership function for each cluster. At last features are extracted from each cluster. The extracted feature equals to the weighted combination of words in each cluster. Moreover, there is no need to specify the number of cluster in advance by trial and error method.

Introduction

Normally, the number of dimensions in feature vector is very large. So, Feature reduction techniques are used for reducing the amount of time and complexity for text classification. The number of features can be reduced by either feature selection or feature extraction. In feature selection, the important dimensions are considered, whereas in feature extraction new dimensions are identified which are more effective than the original dimensions. In practice it has been found that feature extraction is more effective than the feature selection. The feature extraction techniques are basically classified into linear and non linear transformations. Linear transformation includes techniques like, Maximum Margin Criterion, Principal Component Analysis, and

Linear Discriminant Analysis etc. and non-linear transformations include Laplacian Eigen maps, Local Linear Embedding etc

The main applications of text classification are:
- News: Electronic News articles are generated very frequently. Manual classification of these articles is a very difficult. So, automated methods are useful in this case. This application is known as text filtering.
- Digital libraries: A variety of supervised
  
  methods may be used for document organization in domains like digital libraries, web collections and scientific literature. .
- Feedback Mining: Customer Feedback of many products are given on many review
  
  sites, E-commerce sites etc. Text classification algorithms are used by companies to review their products automatically.
- E mail classification and spam filtering: In these days E mail service providers can
  
  identify a mail as spam by certain words and patterns. And mails are also classified as promotion, social networking and important based on certain word clustering techniques
Feature clustering

Feature clustering is one of the famous techniques of feature extraction, where similar features are grouped into a cluster and every cluster is considered as a new feature. Many Techniques were introduced in history. Some of the techniques for text classification through feature clustering are discussed below.

The first feature extraction method based on

feature clustering was proposed by Baker and McCallum[5] which were derived from the distributional clustering. In this approach words are clustered into groups based on the distribution of class labels associated with each word. This method compresses feature space well and maintains high document classification accuracy. But the main disadvantage of this approach is hard clustering.

The Agglomerative Information Bottleneck approach was proposed by Tishby et al [6]. In this approach, text categorization is done by combining distributional clustering of words and a Support Vector Machine. This word-cluster representation is computed using Information Bottleneck method, which generates a compact and efficient representation of documents. . When combined with the classification power of the Support Vector Machine, this method yields high performance in text categorization.

The divisive information-theoretic feature clustering algorithm was proposed by Dhillon et al [9]. In comparison to agglomerative strategies this approach was much faster and achieves comparable or higher classification accuracy. But, this approach uses hard clustering, where a word is assigned strictly to a single subset. Mean and variance were not computed when similarities between the clusters are considered. The main disadvantage of this method is to consider number of new feature which are specified by the user in advance.
Co-Clustering

In co-clustering, both feature clustering and document clustering is done in a two stage procedure. Slonim and tishby[8] proposed a double clustering algorithm in year 2000. In this approach, at the first stage feature clustering generates coarser pseudo features, which reduce noise and sparseness that might be exhibited in the original feature space. Then, in the second stage, documents are clustered as distributions over the distilled pseudo features, and therefore generate more accurate document clusters.

The iterative double clustering algorithm was proposed by (Yaniv & Souroujon). This approach is an extension of simple double clustering algorithm. But, Number of iterations is made to improve clustering quality. This high data is achieved due to the generation of progressively less noisy data representations

On the other hand Dhillon[7], proposed an information-theoretic co clustering algorithm that intertwines both row (feature) and column (document)

clustering. . Here, the row-clustering incorporates column-clustering information and vice versa. The algorithm iterates until it almost accurately reconstructs the original distribution discovers the natural row and column partitions and recovers the ideal compressed distribution. This method is especially useful when the data is sparse.
Fuzzy Approach

In the above approaches we need to specify the number of clusters in advance. And these methods are based on hard clustering concept(sharp cutoffs). In fuzzy approach, Rather than having a precise cutoff between categories or sets, fuzzy logic uses truth values between certain ranges to represent the degree of membership that a certain value has in a given category. Fuzzy clustering is a class of algorith- ms for cluster analysis in which the allocation of data points to clusters is not "hard" but "fuzzy" .I.e. data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign data elements to one or more clusters.

4.1Text classification by self constructing fuzzy approach

As, discussed above there are three disadvantages in the above method

1) The number of features to be extracted should be specified by the user in prior to the clustering. 2) ariance measure is not considered in clustering similarity analysis. 3) Hard clustering.

The detailed process is explained below.

Document set

Read Each Document

Remove stop words and streaming Get Feature Vector

Construct Word Pattern

Figure1. The Initial Processing
Conclusion

Unlike other types of data, Text data have may features. In this paper we have discussed about different types of feature extraction techniques using clustering method. At first, text documents like agglomerative technique, divisive techniques and distributive clustering were discussed. Then, we have reviewed about co-clustering techniques which combine both text and document classification. In the next section we have highlighted fuzzy based clustering approach for classifying the text data. This fuzzy approach considers mean and standard deviation as membership function of each cluster. Moreover, there is no need to specify initial number clusters in advance.
References

Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification IEEE transactions on knowledge and data engineering, vol. 23, no. 3, march 2011
Sainani Arpitha, Dr.P.Raja Prakash Rao Clustering Algorithm for Text Classification Using Fuzzy Logic

International Journal of Advanced Research in Computer Science and Software Engineering.

Volume 2, Issue 8, August 2012
Dinesh Kavuri , Pallikonda Anil Kumar , Doddapaneni Venkata Subba Rao Text and Image Classification using Fuzzy Similarity Based self constructing algorithm international journal of engineering science & advanced technology Volume- 2, Issue-6, pg:1572 1576
A.Kavitha, Y.Sowjanya Kumari, Dr.P.Harini A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification IOSR Journal of Engineering Volume 2, Issue 9 (September 2012), Pg 36-44.
L.D. Baker and A. McCallum, Distributional Clustering of Words for Text Classification, Proc. ACM SIGIR, pp. 96-103, 1998.
R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, Distributional Word Clusters versus Words for Text Categorization, J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
Inderjit S. Dhillon Co-clustering documents and words using Bipartite Spectral Graph Partitioning KDD 2001 San Francisco, California.
Noam Slonim, Naftali Tishby Document clustering using word clusters via the information

bottleneck method Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval Pages 208- 215
I.S. Dhillon, S. Mallela, and R. Kumar, A Divisive Infomation-Theoretic Feature Clustering Algorithm for Text Classification,J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
Ran El-yaniv ad Oren Souroujon Iterative double clustering for unsupervised and semi-supervised learning in In Advances in Neural Information Processing Systems conference.
A. Krishna Mohan, V.V.Narasimha Rao, MHM Krishna Prasad A Novel Fuzzy Based Clustering Algorithm for Text Classification IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.5.

[12]T. Joachims, Text Categorization with Support Vector Machine: Learning with Many Relevant Features, Technical Report LS-8- 23, Univ. of Dortmund, 1998.

Volume 02, Issue 10 (October 2013)

Feature Clustering algorithms for text classification-Novel Techniques and Reviews

Feature Clustering algorithms for text classification-Novel Techniques and Reviews

Word Pattern construction: This is mainly the starting step of the process. Consider,

Automatic cluster formation: The two main advantages of this approach is 1) No initial number of clusters need to be specified in advance. And 2) Self Constructing I.e., Word patterns are considered one after another.

Hard clustering: In hard clustering each word pattern is exactly in one cluster.

Soft Clustering: In soft clustering every

Mixed Clustering: In mixed clustering, is a combination of hard and soft clustering, where user threshold is used to control its degree of hardness or softness.

Leave a Reply