 Open Access
 Total Downloads : 19
 Authors : Anamika Chaudhary, Rajeev K R Singh
 Paper ID : IJERTCONV4IS12015
 Volume & Issue : ETRASECT – 2016 (Volume 4 – Issue 12)
 Published (First Online): 24042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Algorithms for Classification and Clustering
Anamika Chaudhary1 Rajeev K R Singp
Department of CSE Department of CSE
JIET Group of Institutions JIET Group of Institutions
Jodhpur, Rajasthan, India Jodhpur, Rajasthan, India
AbstractFeature selection is an important topic in data mining, especially for high dimensional dataset. Feature selection is a process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of learning algorithm. The best subset contains the least number of dimensions that most contribute to accuracy. Feature selection methods can be decomposed into three main classes, one is filter method, another one is wrapper method and third one is embedded method. This paper presents an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This paper reviews several fundamental algorithms found in the literature and assess their performance in a controlled scenario.
Keywords Feature selection, Filter method, Wrapper method, Information gain, Feature Selection Algorithms.

INTRODUCTION
The feature selection problem is inescapable in inductive machine learning or data mining setting and its significance is beyond doubt. The main benefit of a correct selection is the terms of learning speed, speculation capacity or simplicity of the induced model. On the other hand there are the straight benefits related with a smaller number of features: a reduced measurement cost and hopefully a better understanding of the domain. A feature selection algorithm (FSA) is a computational solution that should be guided by a certain definition of subset relevance although in many cases this definition is implicit or followed in a loose sense. This is so because, from the inductive learning perspective, the relevance of a feature may have several definitions depending on precise objective (Caruana and Freitag, 1994). Thus the need arises to count on common sense criteria that enable to adequately decide which algorithm to use or not to use in certain situation (Belanche and GonzÃ¡lez, 2011). The feature selection algorithm can be classified according to the kind of output one are giving a (weighed) linear order of features and second are giving a subset of the original features. In this research, several fundamental algorithms found in the literature are studied to assess their performance in a controlled scenario. This measure computes the degree of matching between the output given by a FSA and the known optimal solution. Sample size effect also studied. The result illustrates the strong dependence on the particular conditions of the FSA used and on the amount of irrelevance and redundancies in
the data set description, relative to the total number of feature. This should prevent the use of single algorithm even when there is poor knowledge available about the structure of the solution. The basic idea in feature selection is to detect irrelevant and/or redundant features as they harm the learning algorithm performance (Lee and Moore, 2014). There is no unique definition of relevance, however it has to do with the discriminating ability of a feature or a subset to distinguish the different class labels (Dash and Liu, 1997). However, as pointed out in the paper (Guymon and Elisseeff, 2003a), an irrelevant variable may be useful when taken with others and even two irrelevant variables that are useless by themselves can be useful when taken together.
Figure 1. Feature Selection Criteria

THE FEATURE SELECTION PROBLEM
Let X be the original set of features which cardinality X
= n. The continuous feature selection problem (also called feature weighing) refers to the assignment of weights wi to each feature xi X in such a way that the order corresponding to its theoretical relevance is preserved. The binary feature selection problem (also called feature subset selection) refers to the choice of a subset of feature that jointly maximizes a certain measure related to subset relevance. This can be carried out directly as many FSA (Almuallim and Dietterich, 1991: Caruana and Freitag, 1994 ) or setting a cut point in the output of this continuous
problem solution. Although both types can be seen in a unified way (the latter case corresponds to the assignment of weights in {0, 1}), these are quite dierent problems that reflect dierent design objectives. In the continuous case, one is interested in keeping all the features but in using them dierentially in the learning process. On the contrary in the binary case one is interested in keeping just a subset of the features and (most likely) using them equally in the learning process.
A common instance of the feature selection problem can be formally stated as follows. Let J be a performance evaluation measure to be optimized (say to maximize) defined as J: P(X) R+ {0}. This function accounts for a general evaluation measure that may or may not be inspired in a precise and previous definition of relevance. Let C(x) 0 represent the cost of variable x and call C(X)
= for p(X). Let Cx= C(X) be the cost of the whole feature set. It is assumed here that c is additive, that is, C(X X ) = C(X) + C(X) (Belanche and GonzÃ¡lez
,2011).

Relevance of a Feature
The purpose of a FSA is to identify relevant feature according to a definition of relevance. However, the notion of relevance in machine learning has not yet been rigorously defined on a common agreement (Bell and Wang, 2000). Let Ei, with 1 i n, be domains of feature X= {x1.., xn }: an instance space is defined as E= E1
Ã—.Ã—En . where an instance is a point in this space. Consider P a probability distribution on E and T a space of labels (classes). It is desired to model or identify an objective function c: E> T according to its relevant feature. A data set S composed by S instances can be seen as the result of sampling E under p a total of S times and labeling its element using c.
A Primary definition of relevance(Blum and Langley, 1997) is the notion of being relevant with respect to an objective. It is assumed here to be classification objective.
Definition 1 (Relevance with respect to an objective)
A feature xi X is relevant to an objective c if there exist two examples A, B in the instance space E such that A and B differ only in their assignment to xi and c(A) c(B). In other words, if there exist two instances that can only be classified thanks to xi . This definition has the inconvenience that the learning algorithm can not necessarily determine if a feature xi is relevant or not, using only a sample S of E. Moreover, if a problem representation is redundant (e.g. some features are replicated), it will never be the case that two instance differ only in one feature. A proposal oriented to solve this problem (John et al., 1994) include two notions of relevance, one with respect to a sample and another with respect to distribution.
Definition 2 (Strong relevance with respect to S)
A feature xi X is strongly relevant to the sample S if there exist two examples A, B S that only differ in their
assignment to xi and c(A) c(B). That is to say, it is the same definition 1, but now A, B S and the definition is respect to S.
Definition 3 (Strong relevance with respect to P)
A feature xi X is strongly relevant to an objective c in the dstribution p if their exist two examples A,B E with p(A) 0 and p(B) 0 that only differ in their assignment to xi and c(A) c(B).
This definition is natural extension of definition 2 and contrary to it, the distribution p is assumed to be known.
Definition 4 (Weak relevance respect to S)
A feature xi X is weakly relevant to the sample S if there exist a proper X X ( xi X) where xi is strongly relevant with respect to S. A weakly relevant feature can appear when a subset containing at least one strongly relevant feature is removed.
Definition 5 (Relevance as a complexity measure) (John et al., 1994)
Given a data sample S and an objective c, define r(S,c) as the smallest number of relevant feature to c using Definition 1 only in S, and such that the error in S is the least possible for the inducer.In other words, it refers to the smallest number of features required by a specific inducer to reach optimum performance in the task of modeling c using S.
Definition 6 (Incremental Usefulness) (Caruana and Freitag, 1994)
Given a data sample S , a learning algorithm L, and subset of feature X, the feature xi is incrementally useful to L with respect to X if the accuracy of the hypothesis that L produces using the group of features { xi X is better than the accuracy reached using only the subset of features X. This definition is especially natural n FSAs that search in the feature subset space in an incremental way, adding or removing features to a current solution. It is also related to a traditional understanding of relevance in the philosophy literature.
Definition 7 (Entropic relevance) (Wang, Bell, and Murtagh, 1998)
Denoting the Shannon entropy by H(x) and the mutual information by I(x;y) = H(x) H(xy) (the difference of entropy in x generated by the knowledge of y), the entropic relevance of x to y is defined as r(x:y) = I(x:y)H(y). Let X be the original set of feature and let C be the objective seen as a feature, a set X X is sufficient if I (X:C) = I(X,C). For a sufficient set X it turns out that r(X;C) = r (X;C). The most favorable set is that sufficient set X X for which H(X) is smaller. (Molina, Belanche, and Nebot, 2002)

Feature Selection
The main objective of feature selection are that it reduces the dimensionality of feature space, speedup and
reduce the cost of learning algorithms, improve the predictive accuracy of classification algorithm, and also improve the visualization and the comprehensibility of the induced concepts. The feature selection algorithm may be based on three major criterions such as based on some evaluation measure, based on search organization and based on the generation of successors (Guyon and Elisseeff, 2003a).
TABLE1.
FEATURE SELECTION METHODS CHARACTERIZATION BASED ON DIFFERENT CRITERION AND THEIR TYPES
Characterizatio
Types
Evaluation measure
Distance
Divergence
Information
Dependenc
Accuracy
Search Organization
Exponential
Sequential
Random
Generation of successors
Forward
Backward
Compound
Random
Weight

General Methods for Feature Selection
The relationship between a FSA and the inducer chosen to evaluate the usefulness of the feature selection process can take three main forms such as Filter, Wrapper and Embedded.

Filter Methods
These methods select features based on discriminating criteria that are relatively independent of classification. Several methods use simple correlation coefficients similar to Fishers discriminant criterion. Others adopt mutual information or statistical tests (ttest, Ftest). Earlier filter based methods evaluated features in isolation and did not consider correlation between features. Recently, methods have been proposed to select features with minimum redundancy. The methods proposed use a minimum redundancymaximum relevance (MRMR) feature selection framework. They supplement the maximum relevance criteria along with minimum redundancy criteria to choose additional features that are maximally dissimilar to already identified ones. By doing this, MRMR expands the representative power of the feature set and improves their generalization properties (Guyon and Elisseeff,2003a).
Figure 2. Filter Methods

Wrapper Methods
Wrapper methods utilize the classifier as a black box to score the subsets of features based on their predictive power. Wrapper methods based on SVM have been widely studied in machinelearning community. SVMRFE (Support Vector Machine Recursive Feature Elimination), in each recursive step, it ranks the features based on the amount of reduction in the objective function. It then eliminates the bottom ranked feature from the results. A number of variants also use the same backward feature elimination scheme and linear kernel (Kohavi and John, 1997a).
Figure 3. Wrapper Methods

Embedded Method
The inducer has its own FSA (either explicit or implicit). The methods to induce logical conjunctions provide an example of this embedding. Other traditional machine learning tools like decision trees or artificial neural networks are included in this scheme (Guyon and Elisseeff, 2003a)
Figure 4. Embedded Methods


Filter Based Feature Selection Method
Filter based feature selection methods may be broadly categorized into two categories:

Supervised
In supervised learning, the data is assigned to be known before computation and are used in order to learn the parameters that are really significant for clusters. Each object in supervised learning comes with a pre assigned class label.

Unsupervised
In unsupervised learning the datasets are assigned to segments without the cluster being known. Supervised and Unsupervised learning approaches further classified in univariate and Multivariate data. In univariate data analysis it is assumed that the response variable is influenced only by one other factor whereas in multivariate data analysis it is assumed that the response variable is influenced by multiple factors and even combination of factors (Guyon and Elisseeff, 2003a). Classification of filter based feature selection methods on the basis of supervised and unsupervised learning is shown below in Table2, evaluation function used by filter based feature selection method is shown in Table3 and brief description of filter based feature selection methods is shown in Table4.
TABLE 2
CLASSIFICATION OF FILTER BASED FEATURE SELECTION METHODS ON THE BASIS OF SUPERVISED AND UNSUPERVISED LEARNING
TABLE 3
EVALUATION FUNCTION USED BY FILTER BASED FEATURE SELECTION METHOD
Basic Criterion/ Evaluation function used
Examples
Distance based measures
Euclidean distance
Information theory based measure
Entropy, Information gain, mutual information
Data dependency measure
Correlation coefficient
Consistency based measure
Minimum feature bias
TABLE 4
Filter based feature selection method
Basic criterion
Supervised feature selection method
Fisher Score (Duda, Hart, and Stork, 2001)
Distance based, univariate filter method evaluating each feature individually
Relief F (Robnik ikonja and Kononenko, 2003)
mRmR (Peng, Long, and Ding, 2005)
Information theory based uses mutaual information
FCBF (Yu and Liu, 2003)
Based on information gain, Fast correlation based filter
SVMRFE
(Furey et al., 2000)
Ranks features based on their coefficients in the SVM classifier.
Unsupervised rank based feature selection methods
ttest score (Duda, Hart, and Stork, 2001)
Statistical, rank based feature selection approach
Bhattacharya distance
Entropy rank feature
Principal component analysis (PCA)
PCA finds a linear projection of high dimensional data into lower dimensional subspace
Laplacian score based
It is unsupervised feature selection algorithm. It is based on Laplacian Eigen maps.
Filter based feature selection method
Basic criterion
Supervised feature selection method
Fisher Score (Duda, Hart, and Stork, 2001)
Distance based, univariate filter method evaluating each feature individually
Relief F (Robnik ikonja and Kononenko, 2003)
A multivariate filter taking into account dependencies between features
mRmR (Peng, Long, and Ding, 2005)
Information theory based uses mutaual information
FCBF (Yu and Liu, 2003)
Based on information gain, Fast correlation based filter
SVMRFE
(Furey et al., 2000)
Ranks features based on their coefficients in the SVM classifier.
Unsupervised rank based feature selection methods
ttest score (Duda, Hart, and Stork, 2001)
Statistical, rank based feature selection approach
Bhattacharya distance
Entropy rank feature
Principal component analysis (PCA)
PCA finds a linear projection of high dimensional data into lower dimensional subspace
Laplacian score based
It is unsupervised feature selection algorithm. It is based on Laplacian Eigen maps.
BRIEF DESCRIPTION OF FILTER BASED FEATURE SELECTION METHODS
Filter based feature selection methods
Supervised
Unsupervised
Univariat e
Multivariat e
Univariat e
Multivariat e
Relief F(Robnik ikonja and Kononenko, 2003)
No
Yes
No
No
mRmR(Peng
, Long, and Ding, 2005)
No
Yes
No
No
FCBF(Yu
and Liu, 2003)
No
Yes
No
No
Fisher score(Duda, Hart, and Stork, 2001)
Yes
No
No
No
SVM
RFE(Furey et al., 2000)
No
Yes
No
No
ttest(Duda, Hart, and Stork, 2001)
No
No
Yes
No
Entropy based(Duda, Hart, and Stork, 2001)
No
No
Yes
No
Laplacian Score(He, Cai, and Niyogi, 2005)
No
No
Yes
No
PCA(Duda, Hart, and Stork, 2001)
No
No
No
Yes


WRAPPER BASED FEATURE SELECTION METHOD
Wrapper methods (Hall, 1999) are feedback methods which merge the machine learning algorithm in the feature selection process. Wrapper method search through the space of feature subset using a learning algorithm to guide the search. A search algorithm wrapped around the classification model. In search procedure the space of possible feature subset is defined and generated various subsets of features. Wrapper method can be divided in two groups these are deterministic and wrapper methods.
Deterministic wrapper method
This method search through the space of available feature either forward or backward. In forward selection single attribute are added to initially an empty set of attributes.
Randomized wrapper method
Randomized wrapper algorithms search the next feature subset partly at random. Single feature or several features at once can be added, removed or replaced from various feature set. The brief descriptions of wrapper based feature selection methods is shown in Table5.
TABLE 5
BRIEF DESCRIPTION OF WRAPPER BASED FEATURE SELECTION METHODS
Wrapp
Determini
Simple,
Risk of over
Sequential
er
stic
Interact with
fitting, More
forward
the
prone than
selection(SFS
classifier,
Randomized
), Sequential
Models
algorithm to
backward
feature
getting stuck
elimination(S
dependencie
in a local
BE), Plus L
s, Less
optimum(Gre
Minus R,
computation
edy Search),
Beam Search
ally
Classifier
intensive
dependent
than
Selection
randomized
method
Less prone
Computation
Simulated
to local
ally
annealing,
Randomiz
optima,
Intensive,
Randomized
ed
Interact with
classifier
hill climbing,
the
dependent
Genetic
classifier,
selection,
algorithms
Model
High risk of
feature
over fitting
dependencie
than
s
deterministic
algorithms

EMBEDDED FEATURE SELECTION METHOD
Embedded method (Saeys, Inza, and LarraÃ±aga, 2007) sometime also referred as nested subset method. It acts as an integral part of machine learning algorithm itself. During the operation of classification process, the algorithm itself decides which attribute to use and which to ignore. Embedded approach depends on a specific learning algorithm. Embedded methods are faster than wrapper
methods. Decision trees are the best example of embedded method.
Feature Selection Algorithms

CHI (X2Statistics)
This method measure the lack of independence between a term and category. CHISquared is the common statistical test that measures divergence from the distribution expected if one assumes the feature occurrence is actually independent of class value. The X2 test is applied to test the independence of two events, where two events A and B are defined to be independent if P(AB) =P(A)P(B) or equivalently P(A B) = P(A) and P( BA) = P(B).(Liu and Setiono 1995) Feature selection using the X2 statistics is analogous to performing a hypothesis test on the distribution of the class as it relates to the values of the feature in question. Under the null hypothesis, if p of the instance have a given value and q of the instances are in a specific class, (p.q)/n instances have a given value and are in a specific class(n is the total number of instances in the data set)(Liu and Motoda, 2007). This is because p/n instances have the value and q/n instances are in the class, and if the probabilities are independent their joint probability is their product. Given the null hypothesis, the X2 static measure how far away the actual value is from the expected value.

Euclidian Distance
Euclidian Distance is the most common use of distance. In most cases when we talk about distance refer to Euclidian Distance. It examines the root of square differences between coordinates of a pair of object. For each feature Xi calculate Euclidian distance from it to all other features in sample. Euclidian distance d(Xi ; Yi) between features Xi and Yi is calculated using the formula distance(x,y) = {i (xi – yi)2 }Â½ (Dash and Liu, 1997).
This distance generally computed from raw data and not from standardized data.

tTest

The ttest assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttestonly twogroup randomized experimental design. The formula for the ttest is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores(Guyon and Elisseeff, 2003b).
Information Gain
Information gain, of a term measures the number of bits of information obtained for category prediction by the presence or absence of the term in a document. Information Gain measures the decrease in entropy when the feature is given vs absent. This is the application of a more general technique, the measurement of informational entropy, to the problem of deciding how important a given feature is(Kira and Rendell, 1992). Informational entropy, when
measured using Shannon entropy, is notionally the number of bits of data it would take to encode a given piece of information. The more space a piece of information takes to encode, the more entropy it has. Intuitively, this makes sense because a random string has maximum entropy and cannot be compressed, while a highly ordered string can be written with a brief description of the strings information. In the context of classification, the distribution of instances among classes is the information in question. If the instances are randomly assigned among the classes, the number of bits necessary to encode this class distribution is high, because each instance would need to be enumerated. On the other hand, if all the instances are in a single class, the entropy would be lower, because the bitstring would simply say All instances save for these few are in the first class.(Joachims 1998) Therefore function measuring entropy must increase when the class distribution gets more spread out and be able to be applied recursively to permit finding the entropy of subsets of the data. The following formula satisfies both of these requirements:
H(D) = – (ni/n) log(ni/n) i =1,l
where dataset D has n = D  instances and ni members in class ci, i = 1, . . . , l.
The entropy of any subset is calculated as:
H(DX) = – (Xj/n)H(DXXj)
where H(DX = Xj) is the entropy calculated relative to the subset of instances that have a value of Xj for attribute

If X is a good description of the class, each value of that feature will have little entropy in its class distribution; for each value most of the instances should be primarily in one class. The information gain of an attribute is measured by the reduction in entropy (Kira and Rendell, 1992) defined as
IG(X) = H(D) H(DX)
The greater the decrease in entropy when considering attribute X individually, the more significant feature X is for prediction.
Correlation based feature selection (CFS)
Correlation based feature selection (CFS) searches feature subsets according to the degree of redundancy among the features. The evaluator aims to find the subsets of features that are individually highly correlated with the class but have low intercorrelation(Hall, 1999). The subset evaluators use a numeric measure, such as conditional entropy, to guide the search iteratively and add features that have the highest correlation with the class(Saeys, Inza, and LarraÃ±aga, 2007). The downside of univariate filters for example information gain is, it does not account for interactions between features, which is overcome by multivariate filters for example CFS. CFS evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. Correlation coefficients are used to estimate correlation between subset of attributes and class, as well as intercorrelations between the features. Relevance of a group of features grows with the correlation between features and classes, and decreases with growing intercorrelation. CFS is used to determine the best feature subset and is usually combined with search strategies such
as forward selection, backward elimination, bidirectional search, bestfirst search and genetic search(Yu and Liu, 2004).
Equation for CFS is given:
Where rzc is the correlation between the summed feature subsets and the class variable, k is the number of subset features, rzi is the average of the correlations between the subset features and the class variable, and rii is the average intercorrelation between subset features.

Fast Correlation based Feature Selection (FCBF)
Fast Correlation based Feature Selection (FCBF) (Yu and Liu, 2003) uses also the symmetrical uncertainty measure. But the search algorithm is very different. It is based on the predominance idea. The correlation between an attribute X* and the target Y is predominant if and only if y,x*et for all X(XX*), X,X* < Y,x*
Concretely, a predictor is interesting if its correlation with the target attribute is significant (delta is the parameter which allows to assess this one); there is no other predictor which is more strongly correlated to it.
Algorithm for FCBF

S is the set of candidate predictors, M = is the set of selected predictors

Searching X* (among S) which maximizes its correlation with Yy,x*

If y,x* add X* into M and remove X* from S

Remove also from S all the variables X such x,x* y,x*

If S then GOTO (2), else END of the algorithm
This approach is very useful when we deal with a dataset containing a very large number of candidate predictors. About the ability to detect the "best" subset of predictors and it is similar to CFS.


Sequential forward selection (SFS)
Sequential Forward Selection(Jain and Zongker, 1997) is the simplest greedy search algorithm. Starting from the empty set, sequentially add the feature x+ that results in the highest objective function J(Yk+x+) when combined with the features Yk that have already been selected.
Algorithm

Start with the empty set 0={}

Select the next best feature X+ = argmax [J(YkX)];
xÂ¢Yk

Update Yk+1= Yk+X+; K = K+1

Goto 2
SFS performs best when the optimal subset has a small number of features. When the search is near the empty set, a large number of states can be potentially evaluated. Towards the full set, the region examined by SFS is narrower since most of the features have already been selected. The search space is drawn like an ellipse to
emphasize the fact that there are fewer states towards the full or empty sets. As an example, the state space for 4 features is shown. Notice that the number of states is larger in the middle of the search tree. The main disadvantage of SFS is that it is unable to remove features that become obsolete after the addition of other features.


Sequential Backward Elimination (SBE)

Sequential Backward Elimination(Mao, 2004) works in the opposite direction of SFS. Also referred to as SBS (Sequential Backward Selection). Starting from the full set, sequentially remove the feature x that results in the smallest decrease in the value of the objective function J(Y x). Notice that removal of a feature may actually lead to an increase in the objective function J(Ykx)>J(Yk). Such functions are said to be nonmonotonic.
Algorithm

Start with the full set Y0 = X

Remove the worst feature X = argmax [J(YkX)]; x Yk

Update Yk+1=Yk X ; k=k+1

Goto 2
SBS works best when the optimal feature subset has a large number of features, since SBS spends most of its time visiting large subsets. The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded.
III. CONCLUSIONS
This paper presented an empirical comparison of feature selection methods and its algorithm. In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enable to adequately decide which algorithm to use in certain situation. This paper also reviewed several fundamental algorithms found in the literature and assess their performance in a controlled scenario.
REFERENCES

Almuallim, H., & Dietterich, T. G. (1991, July). Learning with Many Irrelevant Features. In AAAI (Vol. 91, pp. 547552).

Athanasakis, D., ShaweTaylor, J., & FernandezReyes, D. (2013). Principled NonLinear Feature Selection. arXiv preprint arXiv:1312.5869.

Belanche, L. A., & GonzÃ¡lez, F. F. (2011). Review and evaluation of feature selection algorithms in synthetic problems. arXiv preprint arXiv:1101.2320.

Bell, D. A., & Wang, H. (2000). A formalism for relevance and its application in feature subset selection. Machine learning, 41(2), 175 195.

Caruana, R., & Freitag, D. (1994, July). Greedy Attribute Selection. In ICML(pp. 2836).

Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (2000). Learning to construct knowledge bases from the World Wide Web. Artificial intelligence, 118(1), 69 113.

Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1(3), 131156.

Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02), 185205.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. 2nd.Edition. New York.

Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., & Haussler, D. (2000). Support vector machine
classification and validation of cancer tissue samples using microarray expression data. Bioinformatics,16(10), 906914.

Gu, Q., Li, Z., & Han, J. (2012). Generalized fisher score for feature selection.arXiv preprint arXiv:1202.3725.

Gupta, P., Doermann, D., & DeMenthon, D. (2002). Beam search for feature selection in automatic SVM defect classification. In Pattern Recognition, 2002. Proceedings. 16th International Conference on (Vol. 2, pp. 212215). IEEE.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 11571182.

Hall, M. A. (1999). Correlationbased feature selection for machine learning(Doctoral dissertation, The University of Waikato).

Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(2), 153158.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features (pp. 137142). Springer Berlin Heidelberg.

Kohavi, R., Sommerfield, D., & Dougherty, J. (1996, November). Data mining using 𝓂 𝓁 𝒸++ a machine learning library in c++. In Tools with Artificial Intelligence, 1996., Proceedings Eighth IEEE International Conference on (pp. 234 245). IEEE.

Ladha, L., & Deepa, T. (2011). FEATURE SELECTION METHODS AND ALGORITHMS. International Journal on Computer Science & Engineering, 3(5).

Lee, K., Joo, J., Yang, J., & Honavar, V. (2006). Experimental comparison of feature subset selection using GA and ACO algorithm. In Advanced Data Mining and Applications (pp. 465 472). Springer Berlin Heidelberg.

Lee, M. S., & Moore, A. W. (2014, June). Efficient algorithms for minimizing cross validation error. In Machine Learning Proceedings 1994: Proceedings of the Eighth International Conference (p. 190). Morgan Kaufmann.

Liu, H., & Motoda, H. (Eds.). (2007). Computational methods of feature selection. CRC Press.

Liu, H., & Setiono, R. (1996, July). A probabilistic approach to feature selectiona filter solution. In ICML (Vol. 96, pp. 319327).

Mao, K. Z. (2004). Orthogonal forward selection and backward elimination algorithms for feature subset selection. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 34(1), 629634.

Masaeli, M., Dy, J. G., & Fung, G. M. (2010). From transformation based dimensionality reduction to feature selection. In Proceedings of the 27th International Conference on Machine Learning (ICML 10) (pp. 751758).

Molina, L. C., Belanche, L., & Nebot, Ã€. (2002). Feature selection algorithms: A survey and experimental evaluation. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on (pp. 306313). IEEE.

Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8), 12261238.

Reinartz, T. (1999). Focusing solutions for data mining: analytical studies and experimental results in realworld domains. Springer Verlag.