Feature Selection Techniques using for High Dimensional Data in Machine Learning

M.Sujatha; Dr. G. Lavanya Devi

doi:10.17577/IJERTV2IS90912

Volume 02, Issue 09 (September 2013)

Feature Selection Techniques using for High Dimensional Data in Machine Learning

DOI : 10.17577/IJERTV2IS90912

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 568
Total Downloads : 456
Authors : M.Sujatha, Dr. G. Lavanya Devi
Paper ID : IJERTV2IS90912
Volume & Issue : Volume 02, Issue 09 (September 2013)
Published (First Online): 27-09-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Feature Selection Techniques using for High Dimensional Data in Machine Learning

M.sujatha 1, Dr. G. Lavanya Devi 2

Department of Computer Science and Systems Engineering AU College of Engineering, Andhra University Visakhapatnam, Andhra Pradesh, India

Abstract

In machine learning the classification task is normally known as supervised learning. In supervised learning present a specified set of classes and objects are labeled with the correct class. The goal is to generalize from the training objects that will make possible novel objects to be identified as belonging to one of the classes. Evaluating the performance of learning algorithms is a fundamental aspect of machine learning. The main objective of this paper is to study the classification accuracy using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of a learning algorithm improved. The filtered feature space that is, the condensed feature space provided by a filter. The advantage of feature selection for learning can include a reduction in the amount of data required to achieve learning, improved predictive accuracy, learned knowledge that is more compact and easily comprehend, and reduced execution time.

Introduction

Data mining takes the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Data mining has two methods. The first method attempts to produce an overall summary of a set of data to identify and express relevant features. The second method, pattern detection, tries to find small unusual patterns of behaviour. The data mining analysis tasks in generally grouped into data summarization, classification, prediction, segmentation, dependency analysis.

Machine learning use tools by which large quantities of data can be automatically analyzed. The important concept of machine learning is feature selection. Feature selection for clustering or classification tasks can carried out on the basis of correlation between relevant features, and feature selection process can be helpful to a variety of common machine learning algorithms. The feature selector is simple and fast to execute. It eliminates irrelevant and redundant data and, in several cases improves the performance of learning algorithms.
Feature Selection

Feature selection is common in machine learning. Feature selection is termed as feature subset selection, variable selection or attributes reduction. Feature selection is the process of selecting the input attributes of a data set that most closely define a particular outcome. The basic three steps of this system are:
- In first step the irrelevant features are removed.
- After that the redundant features are removed.
- And finally a feature selection algorithm is applied to the remaining features.
  
  In this approach each step is working as a filter that reduces the number of candidate features, until finally only a small subset remains.
  
  The first filter removes irrelevant features using a Relief algorithm, which gives relevance values to features from training samples as feature space. There are several modifications to Relief to generalize it for continuous features and to make it more robust in the presence of noise. This system adopts Kononenkos
  
  modifications, and modifies Relief again to remove a bias against non-monotonic features, as described in [1].
  
  Figure 1. Filter feature selection Within this feature selection system, Relief is
  
  used as a relevance filter. Therefore it threshold the relevance values, to divide the feature set into relevant and irrelevant features. This can be done either by threshold the relevance value directly, or by selecting the highest n values and discarding the remaining features. In either case, relief does not detect redundancy, so the remaining feature set still contains redundant features. The second step is a redundancy filter that uses the K-means algorithm [2]
  
  To cluster features according to how well they correlate to each other. When feature clusters are discovered, only the feature with the highest Relief score is kept; the other features in the cluster are removed from the feature set. This is an unusual application of K-means clustering, in that features are clustered (instead of samples), and correlation is used as the distance measure. The third and final filter is a combinatorial feature selection algorithm. The following algorithms are used for to perform the above mentioned operations.
  The basic idea of ReliefF is to illustrate instances at random, calculate their nearest neighbors, and modify a feature weighing vector to give more weight to features
  
  that differentiate the instance from neighbors of different classes [4]. It is also improved to deal with noisy data and can be used for regression problems.
Classification

Classification is a data mining technique used to calculate group membership for data instances. It is one of the important techniques in data mining and is used in various applications such as pattern recognition, customer relationship management, disease diagnosis and targeted marketing [9]. The various classifiers of data mining are

3.1. K Nearest Neighbor

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a non-parametric technique for classifying objects found on closest training instances in the feature space. k-NN is a kind of instance-based learning, or lazy learning where the function is only approximated locally and all calculation is delayed until classification. The k-nearest neighbor algorithm is the simplest of all machine learning algorithms: an object is classified based on majority vote of its neighbors, with the object being allocated to the class most common amongst its k nearest neighbors (k is a positive integer, usually small). If k = 1, then the object is simply allocated to the class of that single nearest

neighbor.

classifier based on Bayes theorem where every

feature is assumed to be class-conditionally independent [11]. In naÃ¯ve bayes learning, each instance is explained by a set of features and obtains a class value from a predefined set of values. Classification of instances finds complex when the dataset contains a large number of features and classes because it takes huge numbers of observations to approximate the probabilities [11]. When a feature is class-conditionally independent, then the variable value on a given class is independent of those values of other variables.
Random forests (RF) are recursive partitioning which merges a collection of trees called an ensemble. Random forests [13] are a group of identically distributed trees whose class value is find by a variant on majority vote. The classifier consists of a collection

discree attribute. These tests are evaluated using gain ratio.

Evaluation Criteria

A classification algorithm is a function that given a set of training samples and their classes constructs a classifier. A classifier is a function that given an instance assigns it to one of the predefined classes. There are a variety of classifiers that have been developed. The main question that arises in the development and application of theses algorithms is about the accuracy of the classifiers they produce. We will be using AUC as evaluation criteria in our thesis which will be discussed in this chapter. AUC is an acronym for Area under Receiver Operating Characteristic Curve [16]. An ROC graph is a technique for visualizing, organizing and selecting classifiers based on their performance. Given a classifier and an instance, there are possible outcomes for the instance. If the instance is positive and it is classified as positive, then it is counted as true positive (TP). If it is classified as negative, then it is counted as false negative (FN). If the instance is negative and it is counted as false positive (FP). If we consider a whole training set we can build a confusion matrix from this methodology [16].

of tree like classifiers which uses a large number of

confusion matrix TP FP

4.1

decision trees, all of which are trained to attempt the

FN TN

same problem.

3.5. C4.5

C4.5 is a variant and extension of an ID3 decision tree algorithm [14].It is based on the concept of a decision tree. A decision tree is a hierarchical collection of rules that describe how to divide a large collection of data into groups based on the regularities of the data [15]. It is a graphical structure used for regression, classification, prediction function and clustering. The objective of a decision tree is to construct an accurate classifier and build up understandable patterns that can be understand as interesting knowledge.

The diagonal (upper left to lower right) of the confusion matrix represent the correct decisions made and the elements of the diagonal (upper right to lower left) represent the errors. The true positive rate of a classifier can be estimated as

Positivescorrectly classified

TPrate 4.2

Total Positives

The false positive rate can be defined as

Negativelyincorrectly classified

C4.5 contains three types of tests: standard test, complex test and binary test. All tests are based on a

Fp rate

Total Negatives

4.3

An ROC graph depicts relative trade-offs between true positives and false positives. To find a clear dominating relation between two ROC curves we use AUC which provides a single-number summary for the performance of` learning algorithms.

We have applied the five feature ranking techniques (GR, RFF, SU, OneR, IG, and Ensemble) to the Lung Cancer dataset. We have selected the top k (k=5) feature subsets for the experiments. After the feature selection, we used five learners, KNN, C4.5, NB, RF, and SVM, to build classification models on the datasets with various selected subset of features. The classification models are evaluated in terms of the AUC performance metric. The results of the experiments are displayed in Table1,2 and 3. Each value in the table is determined by the row (ranker) and the column (learner) in which the value is loaded. It also depends on the value of k used for the table. The process of calculating AUC value for a table is performed in three steps:

Identify the row and column for which the AUC needs to be calculated. This helps in selecting a ranker and a learner.
Ranker is applied to the dataset to get the ranking list. The top k features are selected from the ranking list. The value of k can be determined by checking the table for which the AUC is calculated.
Classification model is built using the dataset with selected features from the previous step.

Table 1. AUC values for rankers with top five feature for lung cancer dataset.

Rankers	KNN	C4.5	NB	SVM	RF	AVG
RFF	0.844	0.769	0.831	0.835	0.848	0.825
OneR	0.758	0.757	0.781	0.773	0.764	0.766
IG	0.877	0.743	0.827	0.821	0.835	0.820
SU	0.877	0.744	0.827	0.821	0.835	0.820
GR	0.735	0.749	0.784	0.773	0.739	0.756

Classifier	AUC
KNN	0.7762
C4.5	0.7823
NB	0.8423
RF	0.7860
SU	0.7633

Classifier	AUC
KNN	0.7762
C4.5	0.7823
NB	0.8423
RF	0.7860
SU	0.7633

Table 2. Average model performances for classifiers using lung cancer dataset

Table 3. Average model performances for rankers using lung cancer dataset

Ranker	AUC
SU	0.7871
OneR	0.7397
GR	0.7598
RFF	0.7887
IG	0.7834

The Tables 1, 2 and 3 summarize the classification performance in terms of AUC for the five selected rankers and ensemble method with top k features. The Tables also display model performance on base dataset. All these results are mapped into a group of features as shown in Figures 1 experiments can be summarized in terms of size of feature subset, classifiers and rankers in the following tables. Table 1 shows that selecting top 5 features subset generates highest classification accuracy NB has the highest classification accuracy over other classifiers while LR performed worst. Table 1, 2 and 3 shows that ensemble ranker performed best over other rankers in terms of AUC performance metric, while OneR performed worst. We also compared the results from the subset of features with the results from the complete set of features (base dataset). We found that the classification performance is improved even after a significant number of features were removed from the original dataset. This demonstrates that feature selection was successfully applied to the Lung Cancer dataset.

Conclusion

In this paper, we have reviewed feature selection and explained the basic concept of different feature selection methods: filter, wrapper and hybrid model. We reviewed four filter based feature ranking techniques and one wrapper based feature ranking technique. They are information gain, gain ratio,

symmetrical uncertainty, reliefF and oneR attribute evaluation. We examined classification models that are built using various classification techniques such as naÃ¯ve bayes, k-nearest neighbor, random forest, and support vector machine, regression and decision trees. We took a brief review of the evaluation criteria used to evaluate the classification models.

Feature selection is to choose a subset of input variables by eliminating features, which are irrelevat or of no predictive information. The RELIEF is an optimization algorithm, which is used to improve the quality of feature selection.

References

Jadhav, S. R., and Kumbargoudar, P., Multimedia Data Mining in Digital Libraries: Standards and Features.Proceedings of conference Recent advances in Information Science and Technology READIT 2007, pages 54-59, Organized by Madras Library Association-Kalpakkam Chapter & Scientific information Resource Division, Indira Gandhi Center for Atomic research, Department of Atomic Energy, Kalpakkam, Tamilnadu,India. 12-13 July 2007.
Witten, I.H., Frank, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques: 2nd Edition. San Francisco: Morgan Kaufmann.
H. Wang, T. M. Khoshgoftaar, K. Gao, Ensemble feature selection technique for software quality classification, Proceedings of the 22nd International Conference on Software Engineering & Knowledge Engineering, Redwood City, San Francisco Bay, CA, USA, July 1 – July 3, 2010, pages 215-220.
Y. Wang, F. Makedon, Application of ReliefF feature filtering algorithm to selecting informative genes for cancer classification using microarray data, Computational systems bioinformatics conference, 2004 IEEE, pages 497 498. [5]S. Pu latova, Covering (rule-based) algorithms Lecture Notes in Data Mining, World Scientific publishing Co, 2006, pages 87-97.

W. Altidor, T. M. Khoshgoftaar, J. Van Hulse, A. Napolitano, Ensemble feature ranking methods for data intensive computing applications, Handbook of data intensive computing, Springer Science + Business media, LLC 2011, pages 349 -376.
Asha G. K, A. S. Manjunath, M. A. Jayaraim, A Comparative Study of Attribute Selection Using Gain Ratio and Correlation Based Feature Selection,46 International Journal of Information Technology and Knowledge Management, July December 2012, Volume 2, pages 271 277.
J. R. Quinlan, Induction of decision tress, Machine Learning, Kluwer Academic Publishers, 1986,

pages 81-106.
M. W. Kim, J. W. Ryu, Optimized Fuzzy Classification for Data Mining, 9th International Conference Y. Lee et al. (Eds): DASFAA 2004, LNCS 2973, pages 582 593.
H. He, W. Graco, X. Yao, Application of Genetic algorithm and k-nearest neighbor method in medical fraud detection, Simulated Evolution and Learning, 47 Second Asia-Pacific Conference on Simulated Evolution and Learning, SEAL 98 Canberra, Australia, Springer, November 1998, pages 74 81.
M. Narasimha Murty, V. Susheela Devi, Pattern Recognition: An Algorithmic approach, Springer, Chapter 4, pages 86 -97.
J. C. Platt, A Fast Algorithm for Training Support Vector Machines, Technical Report MSR-TR-98-14, April 21, 1998, John Platt Microsoft Research.
T. D. Lemmond, B. Y. Chen, A. O. Hatch, W. G. Hanley, An Extended Study of the Discriminant Random Forest, Chapter 6, Data Mining: A Special Issue in Annals of Information systems, Springer Science, LLC 2010.
J. R. Quinlan, Induction of decision tress, Machine Learning,Kluwer Academic Publishers, 1986, pages 81-106.
O. A. Omitaomu, Lecture Notes in Data Mining, Chapter 4, World Scientific publishing Co, 2006,

pages 39 51.
T. Fawcett, ROC Graphs: Notes and Practical Consideration for Researchers, HP Laboratories, March 16, 2004, Kluwer Academic Publishers, pages 1 -38.
Y. Chen, Y. Li, X. Cheng, L. Guo, H. Lipmaa, M. Yung,

D. Lin, Survey and Taxonomy of Feature Selection Algorithms in Intrusion Detection System, Inscrypt 2006, LNCS 4318 Springer-Verlag, Berlin, 2006, pages 153 167.
D. Ienco, R. G. Pensa, R. Meo, Context-based Distance Learning for Categorical Data clustering, IDA 2009, LNCS 5772, Springer, Berlin, 2009, pages 83 94.
L. Rokach, B. Chizi, O. Maimon, Feature selection by combining multiple methods, Advances in Web Intelligence and Data Mining, 2006, pages 295304.
Y. Saeys, T. Abeel, Y. V. Peer, Robust feature selection using ensemble feature selection techniques, W. Daelemans et al. (Eds.): ECML PKDD 2008, Part II, LNAI 5212, pages 313325.
J. T de Souza, N. Japkowicz, S. Matwin, Stochfs: A framework for combining feature selection outcomes through a stochastic process, Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2005,

pages 667674.
J. O. S. Olsson, D. W. Oard, Combining feature selectors for text classification, CIKM 06: Proceedings of the 15th ACM international conference on Information and knowledge management, New York, NY, USA, 2006, pages 798799.

Volume 02, Issue 09 (September 2013)

Feature Selection Techniques using for High Dimensional Data in Machine Learning

Feature Selection Techniques using for High Dimensional Data in Machine Learning

Figure 2. Support Vector Machine

Random Forest

Leave a Reply