A Survey on Different Feature Selection Method for Cancer Biomarker Discovery.

G. Keerthana; Dr. K. Sakthivel

doi:10.17577/IJERTCONV3IS16077

TITCON - 2015 (Volume 3 - Issue 16)

A Survey on Different Feature Selection Method for Cancer Biomarker Discovery.

DOI : 10.17577/IJERTCONV3IS16077

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 126
Total Downloads : 16
Authors : G. Keerthana, Dr. K. Sakthivel
Paper ID : IJERTCONV3IS16077
Volume & Issue : TITCON – 2015 (Volume 3 – Issue 16)
Published (First Online): 30-07-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Survey on Different Feature Selection Method for Cancer Biomarker Discovery.

G. Keerthana

M.E-CSE(1st Year) K.S.Rangasamy College of Technology

Tiruchengode, India

AbstractImage processing in medical diagnosis involve stages such as image capture, image enhancement, image segmentation and feature extraction Identifying key biomarkers for different cancer types can improve diagnosis accuracy and treatment. The identification of biomarkers for early detection could be a promising strategy to decrease mortality. Gene expression data can help differentiate between cancer subtypes. However the limitation of having a small number of samples versus a larger number of genes represented in a dataset leads to the overfitting of classification models. Feature selection is a dimensionality reduction technique widely used and is one of the key topics in machine learning and other related fields it can remove the irrelevant even noisy features and hence improve the quality of the data set and the performance of learning systems. Feature selection methods can help select the most distinguishing feature sets for classifying different cancers. Many feature selection approaches like F-statistic, Maximum Relevance Binary Particle Swarm Optimization (MRBPSO) and Class Dependent Multi- category Classification (CDMC), Biomarker Identifier system are available. This feature selection method combines filter and wrapper based methods. This method improve the computation efficiency and are robust against overfitting

Keywords Features, Biomarkers, Classification, Filter, Wrapper

INTRODUCTION

Classification of data has been successfully applied to a wide range of application areas, such as scientific experiments, medical diagnosis, credit approval, weather prediction, customer segmentation, target marketing and fraud detection. The development of gene expression technologies such as microarray and RNAseq has made it easier to monitor the expression pattern of thousands of genes simultaneously and a huge amount of gene expression data has been produced during these experiments. Cancer classification has been investigated for the identification of tumour biomarkers computationally.

Expression datasets normally consist of a large number of genes compared with a limited number of samples. Due to the high dimensionality of gene expression data, feature selection techniques are used to select a small subset of key genes that change under different cancer conditions. This potentially decreases clinical cost by testing on fewer biomarker genes and improves the accuracy of disease

Dr. K. Sakthivel

Professor K.S.Rangasamy College of Technology

Tiruchengode,India

diagnosis by reducing data dimensionality and removing noisy features.

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. The central assumption when using a feature selection technique is that the data contains many redundant or irrelevant features. Redundant features are those which provide no more information than the currently selected features, and irrelevant features provide no useful information in any context.

A feature selection technique is to be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples.

Feature selection methods are broadly divided into filter, wrapper and embedded methods. Filter type methods select variables regardless of the model. They are based only on general features like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The others variables will be part of the model classification, a regression used to classify or a data prediction. These methods are particularly effective in computation time and robust to overfitting. Wrapper Method evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are:
- The increasing overfitting risk when the number of observations is insufficient.
- The significant computation time when the number of variables is large.
Embedded methods have been proposed to reduce the classification of learning. They try to combine the advantages of both previous methods. The learning algorithm takes advantage of its own variable selection algorithm. So, it needs to know preliminary what a good selection is, which limits their exploitation. Many filter and wrapper based methods have been applied for feature selection such as tabu search, random forest , mutual information, entropy based method, regularized least square, and support vector machine
FEATURE EXTRACTION

Feature extraction starts from an initial set of measured data and builds derived values intended to be informative, non-redundant, facilitating the subsequent learning and generalization steps. Feature extraction is related to dimensionality reduction. When the input data to an algorithm is too large to be processed and it is suspected to be redundant, then it can be transformed into a reduced set of features. This process is called as Feature extraction. The extracted features are expected to contain the relevant information from the input data, so that the desired task can be performed by using this reduced representation instead of the complete initial data.
CONVENTIONAL METHODS

Some of the conventional methods used are described below.
It is a filter-based feature selection method adopted to analyse gene expression data that might be used to discriminate between samples with and without cancer. BMI method combines various statistical measures to discern the ability of features to distinguish between two data groups of interest. It considers three measures for evaluating features. First it checks whether distribution of a feature is significantly different between data groups. If the distribution of a feature changes substantially, the feature might be relevant to the underlying difference between data groups. Second, the ratio of overall variance relative to

variance in control group is used to measure the reliability of a feature. BMI penalizes or credits a score of a feature by the ratio of overall variance relative to variance in control group. Lastly BMI considers the discriminative power of each individual feature by incorporating the true positive rate from logistic regression using the feature.

3.5 Wrapper Based Methods

Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model.

Fig 2: Wrapper Method for feature selection.

3.5.1 Binary Particle Swarm Optimization (BPSO)

The particle swarm optimization is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to given measure of quality. A binary version of particle swarm optimization (BPSO) was proposed for dealing with optimization problems with discrete and binary variables which cannot be handled well by PSO. In BPSO, an initial population of particles is generated with random positions and velocities. The position of particle i, Pi = (Pi1, Pi2, , Pin), represents a potential solution of the optimization problem with n dimension. Pij is a binary value of either 1 or 0 which represents if the corresponding feature is selected or not. The velocity is represented by Vi = (Vi1, Vi2,, Vin). Vij represents the probability of bit Pij taking value1(the feature being selected) after sigmoid transformation and is limited by the maximum velocity parameter Vmax. A particle is updated in each generation by following their personal best position Pbest and global best position of the population called Gbest according to the following two equations:

where c1 and c2 are the acceleration coefficients, Pij is the jth element of then dimensional vector Pi. rand() Produces a random number drawn from the normal distribution between 0 and 1. is a random number selected from the uniform distribution in[0,1] as well. The function sig (Vij) is a sigmoid limiting transformation function.
CLASSIFIERS
K nearest neighbor (KNN) classifies a testing sample according to its K nearest neighbor in the training samples with known classification labels. The sample is then assigned to the class that has the maximum number of neighbors. Fuzzy K nearest neighbor (FKNN) extends the traditional KNN by introducing a fuzzy membership function and distance weight. Fuzzy membership can be used to estimate the confidence level for each class and the weight gives the distance to k nearest neighbors a certain power for the testing sample. The membership value ui(x) to class i is calculated by the following formula:

where k is the number of neighbors used and m is the fuzzifier variable which determines how the membership varies with distance.
CONCLUSIONS

Biomarker identification is one of the major research area in medical domain. It is a challenging task because of the high dimension (thousand of genes) and low amount of samples. A few biomarker genes related to each type of cancer are identified from the frequency analysis in multiple runs. Firest we have to use dimensionality reduction to transform a large dataset into a reduced set of features. By combining filter and wrapper approaches, we can improve the computation efficiency and robust against overfitting. These methods can be applied to any feature selection tasks in other research fields.
REFERENCES

T.Abeel, T.Helleputte, Y.VandePeer, P.Dupont, Y.Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics 26(3)(2010)392398.
P.Maji, S.K.Pal, Fuzzy rough sets for information measures and selection of relevant genes from microarray data, IEEETrans. Syst.ManCybern.PartB Cybern.40(3)(2010)741752.
I. Guyon, A.Elisseeff, An introduction to variable and feature selection, J.Mach. Learn. Res.3 (2003)11571182.
Hall, M.A., Correlation-based feature selection machine learning, Ph.D. Thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1998.
Li, J. and Wong, L., Identifying good diagnostic genes or genes groups from gene expression data by using the concept of emerging patterns, Bioinformatics, 18:725{734, 2002.
Li, J. and Wong, L., Emerging patterns and geneexpression data, Genome Informatics, 12:3{13,2001.
Dietterich, T. [1998] An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization, Machine Learning 1-22.
P. Fortina, S. Surrey, and L. J. Kricka, Molecular diagnostics: hurdles for clinical implementation, Trends Molecular Medicine, vol. 8, pp. 264-266, 2002.
A. Blum and P. Langley, Selection of relevant features and examples in machine learing, Artif. Intell.,vol. 97, no. 1/2, pp. 245271, 1997.
S. Bandyopadhyay, U. Maulik, and D. Roy, Gene identification: Classical and computational intelligence approaches, IEEE Trans. Syst., Man, Cybern. C, vol. 38,no. 1, pp. 5568, Jan. 2008.
Shweta Kharya, Using data mining techniques for diagnosis and prognosis of cancer disease, International Journal of Computer Science,

Engineering and Information Technology (IJCSEIT),Vol.2, No.2,

April 2012
D.Lavanya, Dr.K.Usha Rani,.., Analysis of feature selection with classification: Breast cancer datasets, Indian Journal of Computer Science and Engineering (IJCSE),October 2011.
Thiemjarus .S, B. P. L. Lo, Laerhoven K.V and G.Z.Yang ,(2005). Feature Selection for Wireless Sensor Networks. In Proc of 1st International Workshop on wearable and Implatable Body Sensors Networks.
A.Su, J.Welsh, L.Sapinoso, S.Kern, P.Dimitrov, H.Lapp, P.Schultz,S.Powell,C.Moskaluk, J.Frierson, G.Hampton, Molecular classification of human carcinomas by use of gene expression signatures,CancerRes.61(20)(2001) 73887393.
Ling C., Bolun C., and Yixin C., (2011). Image Feature Selection Based on Ant Colony Optimization.[16] Jin Y., Syed S.

R. A., and Paul H. A., (2005). A hybrid feature selection strategy for image defining features: towards interpretation of optic nerve images.

Vasantha M., Dr.V.S. Bharathi, Dhamodharan R, (2010). Medical Image Feature, Extraction, Selection And Classification, International Journal of Engineering Science and Technology. 2(6), 2071-2076.
A. Sharma, S. Imto and S. Miyano, A top-r feature selection algorithm for microarray gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics 3 (2012), 754764.
H.S. Shon, K.S. Yang C.W. Yoo and K.H. Ryu, Feature selection method using WF-LASSO for gene expression data analysis, in: the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine, 2011, pp. 522524.
Q.H. Hu, D.R. Yu and Z.X. Xie, Neighborhood classifiers, Expert Systems with Applications 34 (2008), 866876.
L. Yu, Y. Han and M.E. Berens, Stable gene selection from microarray data via sample weighting, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9 (2012), 262272.

A Survey on Different Feature Selection Method for Cancer Biomarker Discovery.

Leave a Reply