**Open Access**-
**Total Downloads**: 14 -
**Authors :**Amaningayya -
**Paper ID :**IJERTCONV3IS19007 -
**Volume & Issue :**ICESMART – 2015 (Volume 3 – Issue 19) -
**Published (First Online):**24-04-2018 -
**ISSN (Online) :**2278-0181 -
**Publisher Name :**IJERT -
**License:**This work is licensed under a Creative Commons Attribution 4.0 International License

#### Dimensionality Reduction in Data Sets Without Compromising Clustering Efficiency

Amaningayya Dept. of CSE, AMC Bangalore, India.

AbstractFeature subset selection is the effective way for increasing learning accuracy, improving result comprehensibility, plummeting dimensionality and removing extraneous data. Lots of feature subset selection methods have been presented and deliberate for machine learning applications.Whilethe effectiveness concerns the time required to find the subset of features, the efficiency is related to the excellence of subset of features.Based on the above criteria, the fast clustering-based feature selection algorithm (FAST) is presented and the experimentally evaluated in thispaper. The FAST algorithm works in the two steps. In the first step, the features are divided into clusters by using the graph-theoretic clusteringmethods. Secondly, the most representative feature, i.e., strongly related to target the classes is selected from the each of the cluster toform a subset of these features. Features in the different clusters are relatively autonomous; the clustering- based strategy of the FAST has highlikelihood ofproducing the subset of the useful and the autonomous features. To ensure the efficiency of the FAST, we take on the proficientMinimum- Spanning Tree (MST) clustering method. The competence and the effectiveness of the FAST algorithm are calculated through anexperiential study. widespread experiments are carried out to compare the FAST and the several representative feature selection algorithms,they are: FCBF, CFS, ReliefF, FOCUS-SF and Consist with respect to the four types of well-known classifiers, they are: the probabilitybasedNaive Bayes and the tree-based C4.5, before and after feature selection.

KeywordsCloud computing; Outscourcing; k Nearest neighbor; Voronoi; pre-computed;

INTRODUCTION

Feature subset selection refers to selecting the best set of an attributes to describe the datasets without losing any information for the clustering or the classification. Feature subset selection is effective way for the dipping dimensionality, improving result comprehensibility, removing extraneous data and the increasing learning accuracy. Many feature subset selection methods have been presented and deliberate for the machine learning applications. All of these can be divided into the four broad categories like, the Entrenched, Wrapper, Filter, and the Hybrid approaches.

They mainly are focusing on the combining filter and wrapper methods to get the best likely performance with the particular learning algorithm, with the similar time complexity of filter methods. The wrapper methods are the computationally expensive and lean to over-fit on the tinytraining sets. In the filter methods,addition to the generality, filter methods are usually the good choice when the number of the features is very huge.

In the filter feature selection methods, the cluster analysis applications have been established to be a more effective than the conventional feature selection algorithms.In cluster analysis, the graph-theoretic methods have been well deliberate and used in the many applications. Their results havethe best agreement with the human performance.

Feature subset selection can be seen as the process of identifying and the removing as many immaterial and superfluous features as possible. This is because: (i)Extraneous features do not give to the analytical accuracy, and (ii) Superfluous features do not redound to getting the better predictor for that they provide typically information which is already present in other features.

The entrenched methods incorporate feature selection is the part of the training process and they are usually specific tothe given learning algorithms; therefore they may be furtherefficient than the other 3 categories [28]. Conventionalmachine learning algorithms like the decision trees or the artificialneural networks are examples of the entrenched approaches.The wrapper method uses the predictive accuracy of thepredetermined learning algorithm to determine the decencyof the selected subsets and the precision of the learningalgorithms is typically high. However, generality of theselected quality is limited and computational complexityis huge. The filter methods are autonomous of learningalgorithms, with the good generality andtheir computationalcomplexity is less, but the correctness of the learningalgorithms is not definite [13]. The hybridmethods are a combination of the filter and the wrapper methods [15] by using the filter method to reducethe search space that will be measured by the subsequentwrapper methods. They mainly were focusing on combining the filter andthe wrapper methods that to achieve the best possible accuracywith the particular learning algorithm, with the similar timecomplexity of the filter methods. The wrapper methodsare computationally costly and they lean to over-fit on smalltraining data sets [13] and [15]. The filter methods, they are addition totheir overview and they are usually a good choice whenever the numberof the features are very large.

For filter feature selection methods, thecluster analysis application has been established to bemore and more effective than the conventional feature selection algorithms.Dhillonet al. [18] employ the words for distributional clustering toreduce the dimensionality of the text data.

In clustering analysis, the graph-theoretic methods have beendeliberate and used in the many applications. The resultshavethe best agreement with the human

performance[3]. The general graph-theoretic clustering issimple: compute the neighbourhood graph of instances, and thendelete any edge in the graph that is much longer/shorterthan its neighbours. The resultis the forest and each tree in the forest represents a cluster. Inour study, we apply the graph-theoretic clustering methods tofeatures. In particular, we adopted the minimum spanningtree (MST)- based clustering algorithms, because they do notassumes that data points are grouped around the centres orseparated by the regular geometric curve and have beenwidely used in practice.

An extraneous feature, along with the superfluous features, severely affects the accuracy of the learning machines from the data set. The Feature subset selection should be able to identify and it remove as much as the extraneous and superfluous information as possible as from the dataset. In this paper, we presented theFast clustering- based feature Selection algorithm (FAST) based on the MST. Fig 1, Shows the presented architecture.The FAST algorithm works in two steps.

In the first step, the features are divided into the clusters by using the graph-theoretic clustering methods.

Secondly, the most representative feature that is strongly related to the target classes is selected from each of the cluster to form the final subset of features.

Fig 1: Framework of the presented feature subset selection algorithm.

The step 1 is based only on entropy [deviation in the values across the dataset]. But we will prove that entropy measure is not always accurate in identifying the extraneous feature removal. The feature relevancy to the clustering depends on the domain area of classification. We present to extract the semantic relation between the domain area of application & the features. Based on the domain analysis, the extraneous features will be identified & removed in the Step

1. This will result in highly relevant features subsets. Semantic analysis will be drawn based on web mining the

domain area concept. We will presentthe algorithm to mine he web & provide the relation of any features to the domain area concept.

The hierarchical clustering has been adopted inthe word selection in the context of a text classification (e.g.,[4], and [18]). The distributional clustering has been used tocluster the words into the groups based either on their participationin the particular grammatical relations with each other words byPereira et al. [5] or on the distribution of the class labelsassociated with the each word by Baker and McCallum [4]. Asdistributional clustering of the words are agglomerative inthe nature, and the result in suboptimal word clusters and elevatedcomputational cost, Dhillon et al. [18] presented a newinformation-theoretic divisive algorithm for the word clusteringand applied it to the text classification. Butterworth et al. [8]presented to cluster features using the special metric ofBarthelemy-Montjardet distance, and then makes uses of thedendrogram of the resulting cluster hierarchy to choose themainly relevant attributes. Unluckily, the cluster evaluationmeasure based on Barthelemy- Montjardet distancedoes not identify the feature subset that allows the classifiersto improve their original performance accuracy. Furthermore, even compared with the other feature selection methodsthey obtained accuracy is lower.

The hierarchical clustering also has been used to selectthe features on spectral data. Van Dijck and Van Hulle [6]presented a hybrid filter/wrapper feature subset selectionalgorithm for the regression. Krier et al. [29] presented amethodology that combining hierarchical constrained clusteringof the spectral variables and the selection of clusters by mutualinformation. Their feature in clustering method is similar tothat of Van Dijck and Van Hulle [6] except that the formerforces the every cluster tothe contain consecutive features only.Both the methods employed agglomerative hierarchical clusteringto remove superfluous features.

FEATURE SUBSET SELECTION ALGORITHM Extraneous features, along with the superfluous features,

sternlyaffect the accuracy of the learning machines [31], [27]. Thus,the feature subset selection should be able to identify andto remove as much as of the extraneous and superfluous informationas possible. Moreover, the good feature subsets containthe features highly correlated with the class, and yetuncorrelated witheach other. [30]Keeping these in the mind, we develop a new algorithmwhich can efficiently and effectively deal with the both extraneousand superfluous features, and to obtain a good feature subset.We achieve this through the new feature selection framework(shown in Fig.

1) which composed of the two connectedcomponents of extraneous feature removal and the superfluous featureelimination. The former obtains features relevant to the targetconcept by eliminating the extraneous ones, and the latterremoves the superfluous features from relevant ones via choosingthe representatives from different feature clusters, and thusgenerates the final subset.

The extraneous feature removal is straightforward once theright relevance measure is defined or it selected, while thesuperfluous feature elimination is a bit of sophisticated. In ourpresented FAST algorithm, it involves the following 1) the construction ofthe minimum spanning tree from the weighted completegraph; 2) the partitioning of the MST into

a forest with eachtree representing of a cluster; and 3) the selection ofthe representative features from the clusters.

In order to more precisely introduce the algorithm, because our presented feature subset selection frameworkinvolves the extraneous feature removal and the superfluous featureelimination.The relevant features have the strong correlation with the targetconcept so are always necessary for a best subset, but whilesuperfluous features are not because their values are completelyconcurrent with each other. Thus, the notions of featureredundancy and the feature relevance are normally in terms ofthe feature correlation and the feature-target concept correlation.

Mutual information measures how much of the distributionof the feature values and the target classes differ from statisticalindependence. This is a nonlinear estimation of the correlationbetween feature values or the feature values and the target classes.The symmetric uncertainty (SU) is derived from themutual information by the normalizing it to the entropies ofthe feature values or the feature values and target classes, and hasbeen used to evaluate the decency of features forclassification by the number of researchers (e.g., Hall [29],Hall and Smith [30]). Therefore, we prefer symmetric uncertainty asthe measure of the correlation between either two features or thefeature and the target concept.

The presented FAST algorithm logically consists of threesteps: 1) removing extraneous features, 2) constructing anMST from the relative ones, and 3) partitioning of the MST and theselecting representative features.

For a data set D with m features F = {F1, F2,. . ., Fm} and class C, we compute the T-Relevance SU(Fi, C) value foreach feature Fi(1 i m) in the first step. The featureswhose SU(Fi, C) values are greater than a predefinedthreshold _ comprise the target-relevant feature subsetF'={F'1; F'2; . . . ; F'k}(k m).

In the second step, we first calculate the F- CorrelationSU(F'i,F'j) value for each pair of features F'i and F'j (F'i,F'j F'i j). Then, viewing features F'i and F'j asvertices and SU(F'i, F'j)(i j) as the weight of the edgebetween vertices F'i and F'j , a weighted complete graphG =(V, E) is constructed where V ={F'i|F'i F' i [1, k]} and E =((F'i, F'j)|(F'i,F'j F'i, j [1, k] i j}.As symmetric uncertainty is symmetric further the F-CorrelationSU(F'i, F'j) is symmetric as well, thus G is anundirected graph.

The complete graph G reflects the correlations among allthe target-relevant features. Unfortunately, graph G hask vertices and k (k 1)/2 edges. For high-dimensional data,it is heavily dense and the edges with different weights arestrongly inter-weave. Moreover, the decomposition ofcomplete graph is NP-hard [26]. Thus for graph G, webuild an MST, which connects the all vertices such that the sumof the weights of the edges is the minimum, using this wellknownPrim algorithm. The weight of edge (F'i, F'j) is F- CorrelationSU(F'i,F'j).

Assuming the set of vertices in any one of the final treesto be V (T), we have the property that for each pair ofvertices (F'i,F'jV(T)), SU(F'i, F'j) SU(F'i, C) SU(F'i, F'j) SU(F'j,

C) always holds. We knowthat this property guarantees the features in V (T) aresuperfluous.

The detail of the FAST algorithm is shown in Algorithm

1.

Algorithm 1. FAST

Time complexity analysis: The major amount of work forAlgorithm 1 involves the computation of SU values for T- Relevanceand F-Correlation, which has linear complexity interms of the number of instances in a given data set. The firstpart of the algorithm has a linear time complexity O(m) interms of the number of features m. Assuming k(1 k m)features are selected as relevant ones in the first part, whenk = 1, only one feature is selected. Thus, there is no need tocontinue the rest parts of the algorithm, and the complexity isO(m). When 1 < k m, the second part of the algorithm firstconstructs a complete graph from relevant features and thecomplexity is O(k2), and then generates an MST from thegraph using Prim algorithm whose time complexity is O(k2).The third part partitions the MST and chooses the representativefeatures with the complexity of O(k). Thus when1 < k m, the complexity of the algorithm isO(m + k2). Thismeans when k , FAST has linear complexity O(m),while obtains the worst complexity O(m2) when k = m.

TABLE 1: Summary of the 35 Benchmark Data Sets

However, k is heuristically set to be lgin theimplementation of AST. So the complexity is O(m * lg2 m),which is typically less thanO(m2) since lg2m<m. This can beexplained as follows. Let f(m)= m – lg2 m, so the derivativef'(m)=1-2lg e /m, which is greater than zero whenm

> 1.So f(m) is the increasing function and it is the greater than f(1)which is equal to 1, i.e.,m > lg2 m, whenm > 1. This meansthe bigger the m is, the beyond in the time complexity of FASTdeviates from O(m2). Thus, on the high-dimensional data, thetime complexity of the FAST is far more less than the O(m2). Thismakes the FAST has a better runtime performance with the high-dimensionaldata.

DATA SOURCE AND EXPERIMENTAL PROCEDURE

Data Source

For the purposes of the evaluating the performance andthe effectiveness of our presented FAST algorithm, verifyingwhether or not the method is potentially useful in practice,and allowing the other researchers to confirm our results,35 publicly available data sets were used.

The numbers of features of the 35 data sets vary from37 to 49,52 with the mean of 7,874. The dimensionality of the54.3 percent data sets exceeds the value 5,000, of which the 28.6 percentdata sets have more than the 10,000 features.

The 35 data sets cover the range of the application domainssuch as the text, image and the bio microarray data classification.Table 12 shows the equivalent statistical information.Note that for the data sets with continuous- valuedfeatures, the well-known off-the-shelf MDL method [7]was used to discretize the continuous values.

Experimental Procedure

In order to make the best use of the data and to obtain the stableresults, a (M = 5) Ã—(N = 10) the cross-validation strategy isused. That is, for each of the data set, each feature subset selectionalgorithm and each of the classification algorithm, the 10-foldand cross-validation is repeated M = 5 times, with each time of theorder of the instances of the data set being randomized.This is because many of the algorithms show order effects,in that certain orderings the dramatically improve or the degradeperformance [21]. The randomizing the order of the inputs canhelp diminishes the order effects.

Procedure: Experimental Process

In the experiment, for each of the feature subset selectionalgorithm, we obtain the M Ã— N feature subsets Subset and thecorresponding runtime Time with each of the data set. Average[Subset] and Time, we obtain the number of selected features,further the quantity of the selected features and the correspondingruntime for each of the feature selection algorithm oneach of data set. For each classification algorithm, we obtainthe M Ã— N classification Accuracy for each of the feature selectionalgorithms and each of the data set. Average of these Accuracy, weobtain mean accuracy of each classification algorithm underthe each feature selection algorithm and the each data set. Themethod Experimental Process shows the details.

RESULTS AND ANALYSIS

In this section, we described the experimental results in termsof the quantity of the selected features, the classification accuracy, and the time to obtain thefeature subset, and the Win/Draw/Loss record.

TABLE 2:Quantity of the Selected Features of the Six Feature Selection Algorithms

Quantity of Selected Features

Table 2 records the quantity of selected features of the sixfeature selection algorithms for each of the data set. From this wesee that

Generally all of the six algorithms achieve significant reduction of the dimensionality by selecting the only a small portion of the original features. The FAST, in average, obtains the best quantity of the selected features of 1.82 percent. The Win/Draw/Lossrecord shows that FAST wins other algorithms as well.

For all image data, the quantity of the selected features of each algorithm has an increment; compared with the corresponding average amount of selected features on the given data sets except it Consist has an improvement. This reveals that the five algorithms are not very suitable to choose features for all image data compared with the microarray and the text data. The FAST ranks 3 with the amount of the selected features of 3.59 percent that has a tiny margin of the 0.11 percent to the first and the second best quantity of the selected features 3.48 percent of Consist and the FOCUS-SF, and the margin of 76.59 percent to the

worst quantity of the selected features 79.85 percent of ReliefF.

TABLE 3:The Runtime (in ms) of the Six Feature

Selection Algorithms

For the microarray data, the quantity of selected features has been improved by the each of the six algorithms compared with the given data sets. This indicates that the six algorithms work well with the microarray data. FAST ranks 1 with the quantity of the selected features of the 0.71 percent. In the six algorithms, only CFS cannot able to choose the features for two data sets whose dimensionalities are like: 19,994 and 49,152.

For the text data, the FAST ranks 1 again with a margin of the 0.48 percent to the second best algorithm FOCUS-SF.

Runtime

Table 3 records the runtime of the six feature selectionalgorithms. From it we see that

Generally the entity evaluation-based featureselection algorithms are FAST, FCBF, and ReliefF aremuch faster than the subset evaluation based algorithmsof the CFS, Consist, and the FOCUS-SF. The FAST isconsistently faster than all of the other algorithms.

Theruntime of the FAST is only 0.1 percent of that of CFS,2.4 percent of that of Consist, 2.8 percent of that ofFOCUS-SF, 7.8 percent of that of ReliefF, andthe

76.5 percent of that for FCBF, respectively. The Win/Draw/Loss record shows that the FAST outperformsthe other algorithms as well.

TABLE 4: The Accuracy of Naive Bayes with the Six Feature Selection Algorithms

For the image data, the FAST obtains the rank 1and Its runtime is only 0.02 percent of that of CFS, 18.50 percent of that of ReliefF, 25.27 percent of that of Consist, 37.16 percent of that of FCBF, and the 54.42 percent of that for the FOCUS-SF, respectively. This reveals the FAST is more efficient than the others when choosing the features for all image data.

TABLE 5: The Accuracy of C4.5 with the Six Feature

Selection Algorithms

For the microarray data, the FAST rankis 2 and its runtime is only0.12 percent of that of CFS, 15.30 percent of that of Consist, 18.21 percent of that of ReliefF, 27.25 percent of that of FOCUS-SF, and the

125.59 percent of that for FCBF, respectively.

For all text data, FAST rank is 1 andits runtime is 1.83 percent of that of Consist, 2.13 percent of that of FOCUS-SF, 5.09 percent of that of CFS, 6.50 percent of that of ReliefF, and 79.34 percent of that for the FCBF, respectively. This indicates that the FAST is more efficient than the others when we choosing the features for all text data as well.

Classification Accuracy

The tables 4 and 5 show the 10-fold of cross- validationaccuracies of the four different types of the classifiers on the35 data sets before and after for each feature selectionalgorithm is performed, respectively.

Table 4 shows the classification accuracy for Naive Bayes algorithm.From this we can see that

Compared with the original data, the classificationaccuracy of Naive Bayes has been improved by the FAST,CFS, and FCBF by 12.86, 6.62, and 4.32 percent,respectively. Unfortunately, ReliefF, Consist, andFOCUS-SF have been decreased the

classification accuracyby 0.32, 1.35, and 0.86 percent, respectively. The FASTrank 1 with a margin of the

6.24 percent to the second best accuracy is80.60 percent for the CFS. At the same time,the Win/Draw/Loss records show that the FAST outperformsthe all other five algorithms.

For all image data, the classification accuracy of the NaÃ¯ve Bayes has been improved by the FCBF, CFS, FAST, and ReliefF by the 6.13, 5.39, 4.29, and 3.78 percent, respectively. However, Consist and the FOCUS-SF have decreased the classification accuracy by the 4.69 and 4.69 percent, respectively. This time the FAST ranks 3 with a margin of the 1.83 percent to the best accuracy 87.32 percent for FCBF.

For all microarray data, the classification accuracy of Naive Bayes has been increased by all the six algorithms they are FAST, CFS, FCBF, ReliefF, Consist, and FOCUS-SF by 16.24, 12.09, 9.16, 4.08, 4.45, and 4.45 percent, respectively. The FAST ranks 1 with a margin of the 4.16 percent to the second best accuracy 87.22 percent for CFS. This indicates that the FAST is more effective than the others when we are using Naive Bayes to classify the all microarray data.

For all the text data, the FAST and CFS have improved the classification accuracy for the Naive Bayes by

13.83 and 1.33 percent, respectively. Other four algorithms ReliefF, Consist, FOCUS-SF, and FCBF have decreased the accuracy by the 7.36, 5.87, 4.57, and 1.96 percent, respectively. The FAST rank 1 with the margin of 12.50 percent to the second best accuracy is70.12 percent of the CFS.

Table 5 shows the classification accuracy of C4.5. From itwe see that

Compared with original data, the classificationaccuracy of C4.5 has been improved by FAST, FCBF,and FOCUS-SF by 4.69, 3.43, and 2.58 percent,respectively. Unfortunately, ReliefF, Consist, andCFS have decreased the classification accuracy by3.49, 3.05, and 2.31 percent, respectively. FASTobtains the rank of 1 with a margin of 1.26 percentto the second best accuracy 81.17 percent of FCBF.

For image data, the classification accuracy of C4.5 has been improved by all the six feature selection algorithms FAST, FCBF, CFS, ReliefF, Consist, and FOCUS-SF by 5.31, 4.5, 7.20, 0.73, 0.60, and 0.60 percent, respectively. This time FAST ranks 2 with a margin of 1.89 percent to the best accuracy 83.6 percent of CFS and a margin of 4.71 percent to the worst accuracy 76.99 percent of Consist and FOCUS- SF.

For microarray data, the classification accuracy of C4.5 has been improved by all the six algorithms FAST, FCBF, CFS, ReliefF, Consist, and FOCUS-SF by 11.42, 7.14, 7.51, 2.00, 6.34, and 6.34 percent, respectively. FAST ranks 1 with a margin of 3.92 percent to the second best accuracy 79.85 percent of CFS.

For text data, the classification accuracy of C4.5 has been decreased by algorithms FAST, FCBF, CFS, ReliefF, Consist, and FOCUS-SF by 4.46, 2.70, 19.68, 13.25, 16.75, and 1.90 percent, respectively. FAST ranks 3 with a margin of 2.56 percent to the best accuracy 83.94 percent of FOCUS-SF and a margin of

15.22 percent to the worst accuracy 66.16 percent of CFS.

Sensitivity Analysis

The network k-nearest-neighbour query verification process is shown in Algorithm 1.

Like many of the other feature selection an algorithm, our presented FAST method isalso requires a parameter that is the threshold ofthe feature relevance. The different values might end withthe different classification results.In order to explore the parameter values result in thebest categorizationcorrectnessfor specific classification problems with a given classifier, a 10-fold cross-validationstrategy was employed to disclose how the classificationaccuracy is changing with the value of the parameter .

CONCLUSION

In this paper, we have presented the new clustering-based feature subset selection algorithm for the high dimensional data. The algorithm involves (i) The removing irrelevant features, (ii) The constructing a minimum spanning tree from relative 1s, (iii) The partitioning the MST and the selecting representative feature. In the presented algorithm, a cluster consists of the features. Each of these clustersis treated as single feature and thus the dimensionality is radically reduced. We have compared the efficiency of the presented algorithm with the five well-known feature selection algorithms they are: FCBF, ReliefF,FOCUS-SF, CFS and Consist in the 35 publicly available images, microarray, and the text data from the 4 different aspect of the quantityforthe selected features, runtime, the classification accuracy of a given classifier, and the Win/Draw/Loss record. Generally, the presented algorithm obtained by the best quantity of the selected features, the best runtime, andthebest categorization accuracy for the Naive Bayes andfor the C4.5. The Win/Draw/Loss records confirm the conclusions.

REFERENCES

H. Almuallim and T.G. Dietterich, Algorithms for Identifying Relevant Features, Proc. Ninth Canadian Conf. Artificial Intelligence, pp. 38-45, 1992.

H. Almuallim and T.G. Dietterich, Learning Boolean Concepts in the Presence of Many Irrelevant Features, Artificial Intelligence, vol. 69, nos. 1/2, pp. 279-305, 1994.

J.W. Jaromczyk and G.T. Toussaint, Relative Neighborhood Graphs and Their Relatives, Proc. IEEE, vol. 80, no. 9, pp. 1502- 1517, Sept. 1992.

L.D. Baker and A.K. McCallum, Distributional Clustering of Words for Text Classification, Proc. 21st Ann. Intl ACM SIGIR Conf. Research and Development in information Retrieval, pp. 96-103, 1998.

F. Pereira, N. Tishby, and L. Lee, Distributional Clustering of English Words, Proc. 31st Ann. Meeting on Assoc. for Computational Linguistics, pp. 183-190, 1993.

G. Van Dijck and M.M. Van Hulle, Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual Information Relevance and Redundancy Analysis, Proc. Intl Conf. Artificial Neural Networks, 2006.

M.F. Usama and B. Keki, Irani: Multi-Interval Discretization of Continuousvalued Attributes for Classification Learning, Proc. 13th Intl Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993.

R. Butterworth, G. Piatetsky-Shapiro, and D.A. Simovici, On Feature Selection through Clustering, Proc. IEEE Fifth Intl Conf. Data Mining, pp. 581-584, 2005.

C. Cardie, Using Decision Trees to Improve Case-Based Learning, Proc. 10th Intl Conf. Machine Learning, pp. 25-32, 1993.

P. Chanda, Y. Cho, A. Zhang, and M. Ramanathan, Mining of Attribute Interactions Using Information Theoretic Metrics, Proc. IEEE Intl Conf. Data Mining Workshops, pp. 350-355, 2009.

S. Chikhi and S. Benhammada, ReliefMSS: A Variation on a Feature Ranking Relieff Algorithm, Intl J. Business Intelligence and Data Mining, vol. 4, nos. 3/4, pp. 375-390, 2009.

W. Cohen, Fast Effective Rule Induction, Proc. 12th Intl Conf. Machine Learning (ICML 95), pp. 115-123, 1995.

M. Dash and H. Liu, Feature Selection for Classification, Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.

M. Dash, H. Liu, and H. Motoda, Consistency Based Feature Selection, Proc. Fourth Pacific Asia Conf. Knowledge Discovery and Data Mining, pp. 98-109, 2000.

S. Das, Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection, Proc. 18th Intl Conf. Machine Learning, pp. 74- 81, 2001.

M. Dash and H. Liu, Consistency-Based Search in Feature Selection, Artificial Intelligence, vol. 151, nos. 1/2, pp. 155-176, 2003.

J. Demsar, Statistical Comparison of Classifiers over Multiple Data Sets, J. Machine Learning Res., vol. 7, pp. 1-30, 2006.

I.S. Dhillon, S. Mallela, and R. Kumar, A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification, J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.

E.R. Dougherty, Small Sample Issues for Microarray-Based Classification, Comparative and Functional Genomics, vol. 2, no. 1, pp. 28-34, 2001.

U. Fayyad and K. Irani, Multi-Interval Discretization of Continuous- Valued Attributes for Classification Learning, Proc. 13th Intl Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993.

- p>D.H. Fisher, L. Xu, and N. Zard, Ordering Effects in Clustering, Proc. Ninth Intl Workshop Machine Learning, pp. 162-168, 1992.
F. Fleuret, Fast Binary Feature Selection with Conditional Mutual Information, J. Machine Learning Research, vol. 5, pp. 1531-1555, 2004.

G. Forman, An Extensive Empirical Study of Feature Selection Metrics for Text Classification, J. Machine Learning Research, vol. 3, pp. 1289-1305, 2003.

M. Friedman, A Comparison of Alternative Tests of Significance for the Problem of m Ranking, Annals of Math. Statistics, vol. 11, pp. 86- 92, 1940.

S. Garcia and F. Herrera, An Extension on Statistical Comparisons of Classifiers over Multiple Data Sets for All Pairwise Comparisons, J. Machine Learning Res., vol. 9, pp. 2677-2694, 2008.

M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co, 1979.

R. Kohavi and G.H. John, Wrappers for Feature Subset Selection, Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997.

I. Guyon and A. Elisseeff, An Introduction to Variable and Feature Selection, J. Machine Learning Research, vol 3, pp. 1157- 1182, 2003.

C. Krier, D. Francois, F. Rossi, and M. Verleysen, Feature Clustering and Mutual Information for the Selection of Variables in Spectral Data, Proc. European Symp. Artificial Neural Networks Advances in Computational Intelligence and Learning, pp. 157-162, 2007.

M.A. Hall and L.A. Smith, Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper, Proc. 12th Intl Florida Artificial Intelligence Research Soc. Conf., pp. 235- 239, 1999.

M.A. Hall, Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning, Proc. 17th Intl Conf. Machine Learning, pp. 359-366, 2000.