Support Vector Machine based Prediction of Transcription Factor Binding

Smitha C S; Dr. Afzal A L

doi:10.17577/IJERTV10IS030066

Volume 10, Issue 03 (March 2021)

Support Vector Machine based Prediction of Transcription Factor Binding

DOI : 10.17577/IJERTV10IS030066

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 144
Authors : Smitha C S , Dr. Afzal A L
Paper ID : IJERTV10IS030066
Volume & Issue : Volume 10, Issue 03 (March 2021)
Published (First Online): 19-03-2021
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Support Vector Machine based Prediction of Transcription Factor Binding

Smitha C S

Assistant Professor in Computer Science & Engineering College of Engineering Muttathara Thiruvananthapuram, Kerala

Dr. Afzal A L

Assistant Professor in Computer Science & Engineering College of Engineering Muttathara

AbstractAny living organism, both prokaryotic and eukaryotic is primarily based on the working of genes which comes under gene regulation. Protein synthesis is the primary concern in the gene regulation which is controlled by the transcription factor binding sites. The genetic behavior is primarily based on the interaction of synthesized proteins with the DNA regions. It is very difficult to identify exact binding sites and hence computational methods are used. The computationally identified binding sites are validated using experimental methods. Feature identification of motifs is made and remarkable features are ranked and selected based on Principal Component Analysis. Support Vector Machine based analysis and prediction increases the accuracy and prediction of the obtained results.

Keywords RBF kernel; Principal Component Analysis; Transcription Factor

INTRODUCTION

The smallest unit of life is the cells that form the building block of living things. Cells contain several regulatory elements, and nucleus which houses the genes. Inside the nucleus there are chromosomes which contain genes as the genetic carrier in a coiled form [1]. The fundamental unit of hereditary information can be described as the genes which act as a medium through which basic traits are passed from an organism to its successor. The study of genome is a vital factor in identifying the genetic behavior which flags on to the detection and cure of many diseases. Innumerable data analysis with prediction is experimentally very difficult and hence Self organizing algorithms are used for data visualization in genes. DNA is the carrier of genes which in turn contain the nucleotides which of the prime focus. The genetic details are coded in the Deoxyribo Nucleic Acid (DNA). DNA comprises four nucleotides Adenine, Thymine, Cytosine and Guanine which are in short abbreviated as A, T, C and G. DNA is structured as a double stranded coil in the form of a helix. To form the basic structure of the helix Adenine pairs with Thymine using two hydrogen bonds whereas Cytosine pairs with Guanine with three hydrogen bonds. The Cytosine Guanine bonding is a comparatively stronger bond than the Adenine Thymine pair.
Many proteins are formed and some of the generated proteins may reenter the nucleus to interact with specific portions of DNA. These proteins are known as Transcription Factors (TF). The portions of DNA that interact with these Transcription Factors are known as Transcription Factor Binding Sites (TFBS) or in short binding sites [3]. The binding sites are specific for any organism and they play a major role in gene regulation.

This paper gives an insight into the prediction of DNA protein interaction. The computational prediction is primarily focused on feature extraction, analysis of extracted features and then further verification using the Support Vector Machine classifier. Section 2 lists some of the related work, Section 3 gives the proposed system and Performance assessment is shown in Section 4. Section 5 gives the future scope and applications and finally section 6 concludes the paper.
RELATED WORK

There are mainly 4 category of binding prediction algorithms which can be listed as enumerative, iterative, phylogenetic foot printing and content based algorithms. The enumerative algorithms use a background model for the binding prediction and the probability technique of Expectation Maximization is used in iterative approach for the binding prediction. The consensus of binding sites can be annotated by a set of frequently used subsequences. A set of known binding sites from evolutionarily related species are used in Phylogenetic foot printing. Ortholog and homolog modeling can be used for obtaining the relevant details including the regulatory elements. The content based algorithms use the divide and conquer approach; basis of which is to take a sequence of sufficient length, dividing it into sub sequences that is sufficient for calculations and the regularities of subsequence is used for prediction [4]. The chances of errors resulted from the analysis of a single algorithm is very large and hence a class of twelve algorithms are used for the accurate prediction. But this may lead to an increased additional load.

The amino acids and nucleotides are represented using a standard IUPAC notation. The selected notation incorporates the entire DNA and RNA nucleotides as well as the twenty amino acids representation. The prediction is based on both nucleotide and amino acids which result in the analysis of both. The feature identification is based on regularity of patterns that produces remarkable number of distinguished features. Due to the invariant size of the data, the identified features will be of high dimensionality that may result in incorrect predictions in majority of the cases. This will require a well defined and thorough analysis of identified binding sites. To reduce this tedious work in high dimensionality techniques like wrappers, filters and embedded methods are commonly used. This employ additional parameters like F-score, correlation coefficient etc. for further analysis and evaluation.

Most of the binding predictions can be made based on the similarity matrix which is also called as Position Specific Scoring Matrix [5]. An enumerative approach is used for the feature extraction with the adequate data taken with the help of background subsequences. The similarity matrix can be prepared based on the DNA nucleotides or amino acids which result in varying dimensions. From the matrix, consensus can be prepared which can be used for finding the Information Content [6]. The similarity score can also be used to reveal the phylogenetic content.

The prediction of binding sites can be based on the protein DNA interaction residue [7]. The details about amino acid can be obtained from the Protein Data Bank. The approach use the data collected from a set of known binding sites which can be collected from some of the known databases like JASPAR and TRANSFAC. The technique is mainly based on the experimentally available binding sites that can be obtained by techniques like DNase Foot printing and ChIP techniques [8]. The approach has the disadvantage of tedious data collection, verification and validation.
The Information Content obtained from the similarity matrix can be used for training and testing phase of machine learning. A Neural Network (NN) can be trained for the prediction results which can be tested using any testing set [10]. The data sets for the prediction can be taken as PDNA

62. The similarity can be accessed by BLAST values.

BLAST hits can be collected from the protein databases like Uniprot. The approach require large amount of training data to extract the sufficient details for training. The number of hidden intermediate layers must be set up for better evaluation of the prediction.

Genetic algorithm based prediction is another technique for prediction which uses the principles of genetic algorithm [11]. The operations used are initialization, evaluation, selection, crossover and mutation. Initialization is used for selecting the sequences for analysis. The binding prediction can be made using crossover and mutation of selected subsequences. The genetic operations cannot be applied for all eukaryotic and prokaryotic binding predictions.
PROPOSED SYSTEM

The proposed method gives a computational approach for transcription factor binding identification using Support Vector Machine. A stepwise methodology was employed which is described in Figure 3.1. Preprocess the dataset obtained from NCBI database to create fixed length sequences. Obtain the Transcription Factor details from the UniProt database. Extract features like binding affinities and oligonucleotide properties. Perform Principal Component Analysis to filter the features that do not play a major role in prediction. Perform classification using Support Vector Machine (SVM) and assess performance measure of the classifier.

Figure 3.1: Proposed Method for binding prediction
PERFORMANCE ASSESSMENT

A classifier is having a binary mode of prediction. It means the mapping from input instances to predicted classes. The prediction is Yes or No which indicate the positive and negative classes simultaneously. Classification is based on mainly two types of outcomes. The outcome can be labeled as Actual class and Predicted class. There are four possible outcomes for a binary classifier. They are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). If the predicted value and actual value are positive, then it is called TP, but if the predicted value is positive for an actual negative value then it is FP. Similarly if the predicted outcome is negative for an actual positive value then it is called FN and if the predicted negative outcome is obtained for an actual negative then it is TN.

Accuracy is the proportion of the total number of predictions that were correct.

Accuracy = (TP + TN) / (TP + TN + FP + FN) (4)

Sensitivity is the percentage of binding sites correctly predicted as binding sites.

Sensitivity =TP / (TP + FN) (5)

Specificity is the percentage of nonbinding sites correctly predicted as nonbinding sites.

Specificity = TN / (TN + FP) (6)

ROC is a technique for visualizing and analyzing the classifiers performance. ROC graph is widely used to analyze the machine learning classifiers as a performance graphing method. It is a graphical plotting with True Positive rates on the Y-axis and False Positive rates on the X-axis. So it is a plot of Sensitivity on Y-axis against 1-Specificity on X-axis.

Fig. 4.1: ROC curve of SVM classifier

Fig. 4.1 shows the ROC graph. The Area Under Curve value is greater than 0.5 which indicates a good classifier.

Fig. 4.2: Comparison with existing techniques

Fig. 4.2 shows the comparison of the proposed prediction with some of the existing techniques. The existing techniques used for comparison are PSSM based, Amino Acid (AA) based and biological (biochemical) feature based. Our model outperforms the existing ones which is a clear indication of better prediction.
FUTURE SCOPE AND APPLICATIONS

The interaction of multiple Transcription Factors to form a complex Transcription Factor binding model can be extended as a future work. The model acts as a regulatory network for determining the binding locations. The regulatory network acts as a model for determining the hereditary traits of an organism. It can also act as a means for detection of mutations in the genetic data which leads to the early diagnosis of tumors.

CONCLUSION

Transcription Factor binding prediction is one of the most important activities in gene regulation. The prediction is highly complicated because of the large size and confusing data. Computational techniques help in the prediction of binding sites which can be experimentally validated. A group of 5 data sets were selected for the study. The proposed method using feature extraction and feature selection was analyzed using the classifier Support Vector Machines. The performance comparison was made with accuracy, sensitivity, specificity and ROC graph. The proposed system was analyzed and found to be better than existing systems.

REFERENCES

Stephane. M. Nelson, Lynnette. R. Furguson and William. A. Denny, DNA and the Chromosome Varied Targets for Chemotherapy, BioMedCentral, Cell & Chromosome, 2004,doi:10.1186/1475-9268-3-2.
Sankar. Subramanian and Sudhir Kumar, Neutral Substitutions occur at a Faster Rate in Exons than in Noncoding DNA in Primate Genomes, Genome Research, 2003, 13, pp.834-844.
N. M. Luscombe and J. M. Thornton, Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity, Journal of Molecular Biology, July 2002.
T. T. Nguyen and I. P. Androulakis, Recent advances in the computational discovery of transcription factor binding sites, Algorithms, pp. 582602, September 2009.
D. T. Jones, Protein secondary structure prediction based on position specific scoring matrices., Molecular Biology, no. 292, pp. 195202, 1999.
G. D. Stromo, DNA binding sites: Representation and discovery, Bioinformatics, vol. 16, no. 1, pp. 1623, 2001.
S. Ahmad and A. Sarai, PSSM based prediction of DNA binding sites in proteins, BMC Bioinformatics, February 2005.
G. D. Stromo, DNA binding sites: Representation and discovery, Bioinformatics, vol. 16, no. 1, pp. 1623, 2001.
Nanjiang Shu, Tuping Zhou and Sven Hovmoller, Prediction of Zinc-Binding Sites in Proteins from Sequence, Bioinformatics, Vol. 24, No.6, 2008, pp.775-782.
Brusic V, Rudy G, Honeyman G, Hammer J and Harrison L, Prediction of MHC ClassII Binding Peptides using an Evolutionary Algorithm and Artificial Neural Network, Bioinformatics, 1998, pp.121-130.
X. Chang, W. Zhou, C. Zhou, and Y. Liang, Prediction of transcription factor binding sites using genetic algorithm., in IEEE Computational Systems Bioinformatics Conference, 2006.
J. E. Jackson, A users guide to principal components, John Wiley and Sons, 1991.
C. Cortes and V. Vapnik, Support vector networks, Machine Learning, vol. 20, pp. 273297, September 1995.

Support Vector Machine based Prediction of Transcription Factor Binding

Leave a Reply