Application of Support Vector Machine in Prediction Secondary Structure Protein

Nahla .I.Jabbar; Rafah Ibraheem Jabbar

doi:10.17577/IJERTV8IS100286

Volume 08, Issue 10 (October 2019)

Application of Support Vector Machine in Prediction Secondary Structure Protein

DOI : 10.17577/IJERTV8IS100286

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 245
Authors : Nahla .I.Jabbar, Rafah Ibraheem Jabbar
Paper ID : IJERTV8IS100286
Volume & Issue : Volume 08, Issue 10 (October 2019)
Published (First Online): 05-11-2019
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Application of Support Vector Machine in Prediction Secondary Structure Protein

Nahla .I.Jabbar Rafah Ibraheem Jabbar

Chemical Engineering Metallurgical Engineering

Babylon University. Babylon University

Babylon, Iraq Babylon, Iraq

Abstract This paper studying the predication of secondary structure protein from primary structure protein using support vector machine (SVM).We classify 64 types of proteins in three types : Helices(H) ,Strand(E) and Coil ( C ).In our SVM ,we use a Gaussian kernel with parameter =0.1 and costing parameter c between [0.1,5].The results of support vector machine have different varying accuracies of three types of proteins .

Keywords Support Vector Machine; Amino Acid Sequences and Secondary Protein Structure

INTRODUCTION

Support Vector Machine (SVM) was known in 1992, introduced by Boser, Guyon, and Vapnik[1]. Support vector machines (SVMs) are a set of related supervised learning methods used for classification and prediction tool. It have been successfully applied in a variety of biological application for example in automatic classification of microarray gene expression profiles [2]. It is also being used for many applications, such as hand writing analysis, face analysis. SVMs were developed to solve many problems but recently they have been extended to solve bioinformatic application .Support vector machine have been successfully applied in a prediction problem for example in protein secondary structure Protein structure prediction by using bioinformatics can involve sequence similarity searches, multiple sequence alignments, identification and characterization of domains, secondary structure prediction, A central part of a typical protein structure prediction is the identification of a suitable structural target from which to extrapolate three-dimensional information for a query sequence .The main objective of this paper is to predict the secondary structure of protein using support vector machine

.This will be done by measuring the performance of the algorithm using their accuracies of prediction performance. All experiments are done by matlab software

SUPPORT VECTOR MACHINE

Support Vector Machine or SVM is related to Structural Risk Minimization (SRM). At the first place, SVM is an initial form for binary classification, but now it can also be used for multiclass classification. SVM method does mapping form input space to a higher dimensional space to support nonlinear classification problems where a maximal separating hyperplane is constructed. Hyperplane is a linear pattern whose maximum margin gives the maximum separation between the decision classes[3]. given data set

where N is known as the number of sa

mples,

mples,

is known as feature vectors from sample-, where D is the number of feature (dimension), and is known as class labels. For two class classification problem {1, +1}, however for the multiclass classification problem 1,2,…, where is the number of class. The main purpose of SVM is to find the best hyperplane:

Fig. 1. SVM is trying to find the best hyperplane that separates the two classes, class A and B

Distance of closest point on hyperplane to origin can be found by maximizing the x as x is on the hyper plane. Similarly for the other side points we have a similar .This solving and subtracting the two distances we get the summed distance from the separating hyperplane to nearest points. Maximum Margin = M = 2 / ||w||
PROTEIN STRUCTURE PREDICATION

Proteins are made of simple building blocks called amino acids. There are 20 different amino acids that can occur in proteins. Their names are abbreviated in a three-letter code or a one letter code. The amino acids and their letter codes are given in Table [1]

Glycine	Gly	G	Tyrosine	Try	Y
Alanine	Ala	A	Methionine	Mer	M
Serine	ser	S	Tryptophan	Trp	W
Threonine	Thr	T	Asparagine	Asn	N

Cysteine	Cys	C	Glutamine	Gln	Q
Valine	Val	V	Histidine	His	D
Isoleucine	Ile	I	Aspartic Acid	Asp	D
Leucine	Leu	L	Glutamic Acid	Glu	E
Proline	Pro	P	Lysine	Lys	K
Phenylalanine	Phe	F	Arginine	Arg	R

Cysteine	Cys	C	Glutamine	Gln	Q
Valine	Val	V	Histidine	His	D
Isoleucine	Ile	I	Aspartic Acid	Asp	D
Leucine	Leu	L	Glutamic Acid	Glu	E
Proline	Pro	P	Lysine	Lys	K
Phenylalanine	Phe	F	Arginine	Arg	R

sequences do not show any similarities with the sequences of proteins in the database.

IV.METHODLOGY OF THE WORK

We can represent this work by flowing diagram

Data set

Input

SVM

Evaluation

Results

Evaluation

Results

coding

III PROTEIN STRUCTURE

There are four different structure types of proteins, namely Primary, Secondary, Tertiary and Quartenary structures. Primary structure refers to the amino acid sequence of a protein. It provides the foundation of all the other types of structures. Secondary structure refers to the arrangement of connections within the amino acid groups to form local structures. helix, . strand are some examples of structures that form the local structure. Tertiary structure is the three dimensional folding of secondary structures of a polypeptide chain. Quartenary structure is formed from interactions of several independent polypeptide chains. The four structures of proteins are shown in Figure (2)

Fig. 2. Protein structure

There exists a relationship between protein structure and function. Proteins with similar sequences and structures have similar functions. Moreover, similar sequences in proteins imply that they also have similar structures. However, similar structures in proteins may have different sequences and different functions .The primary structure of proteins can be used to predict its tertiary structure. It is through the tertiary structure of the protein that we can derive its properties as well as how they function in an organism. Secondary structure prediction means predicting the secondary structure o a protein from its primary sequence. It is important because knowledge of secondary structure helps in the prediction of tertiary structure. This is very interesting for proteins whose

Data set

The data are collected from Data bank ,it includes 62 types of proteins he data to be used consists of 62 proteins from Rost and Sander (1983) database available from [5] It contains a protein name, its primary and secondary sequences. The data are defined in rows :protein names and amino acid sequence for example

Acprotease GVGTVPMTDYGNDVEYYGQVTIGTPGKSFNLNFDTGSSNLW VGSVQCQASGCKGGRDKFNPSDGSTFKATGYDASIGYGDGSA SGVLGYDTVQVGGIDVTGGPQIQLAQRLGGGGFPGDNDGLLG LGFDTLSITPQSSTNAFQDVSAQGKVIQPVFVVYLAASNISDG DFTMPGWIDNKYGGTLLNTNIDAGEGYWALNVTGATADST YLGAIFQAILDTGTSLLILPDEAAVGNLVGFAGAQDAALGGFV IACTSAGFKSIPWSIYSAIFEIITALGNAEDDSGCTSGIGASSLG EAILGDQFLKQQYVVFDRDNGIRLAPVA.

For the two methods, the same partitioning of the data into training set, validation set and test set was used. 10-fold cross validation as described in was used; the data was randomly divided into 10 parts, one used as a test set and the rest for training. However, training data and used as validation set. The window size was fixed at 13.

Input coding

The data of the 20 amino acid residues into letters .The purpose of input coding was converted these letters into numbers, thats coding are done by orthogonal method in Figure(3) and Table[2].

Fig.3. A sliding window of length 7

Table 2

K	L	N	T	D	E	T	G	A	C	P	Q	A	C	Y	A
0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1
0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0
0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

The orthogonal coding for each residue. The function of input coding the letters for each sequence and the outputs are the unit vectors (orthogonal coding) for each residue in table[3].

A	C	D	E	F	G	H	I	K	L	M	N	P/p>	Q	R	S	T	V	W	Y
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Table 3

1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1

Therefore, the orthogonal coding for a sequence KLNTDETGACPQACYA, is given in Table [1]From this table a unique binary vector is assigned for each residue.

Support Vector Machine (SVM)

The aim of this script is to produce the prediction overall training data and on test data using Support Vector Machines. In this approach, the following six binary classifiers are created :H/ H, E/ E, C/ C, H/E, E/C and C/H were constructed as in SVM. Kernel parameter was fixed for =

0.1 [24]. The cost parameter C was however varied over the following values: 0.1,0.3,0.5,0.7,0.9,1,5. The best results are in window size 13
Evaluation of result

The objective of this step is to compare the performance Support Vector Machines. This approach was applied on 62 globular proteins. The results of multi class SVM show that H/ H has the highest accuracy while C/ C has the lowest prediction accuracy

Table 5 Protein secondary structure prediction with Support Vector Machines

Quality index

% Accuracy

Overall accuracy

73.5

H/ H

79.36

E/ E

79.15

C/ C

67.10

H/E

72.02

E/C

72.10

C/H

74.66

The effect of variations in cost parameter on prediction accuracy and the relationship between the values of cost parameters and time required for training are also discussed. Figure(4) shows different cost parameters and the time taken to train the six binary classifiers. Positive relationship exists between time and the cost parameters because an increase in the number of cost parameters increases the time required to train the classifiers. The longest time it to train a classifier was about 57 minutes, and that occurred at C = 5 for C/ C. The shortest time to train a classifier was about 13 minutes, which occurred at C = 0.1 for H/E. Furthermore, training time increases rapidly between C = 1

Fig.5. Training time vs. cost parameters in Support Vector Machines

CONCLUSION

Protein structure prediction is an important step towards predicting the tertiary structure of proteins. The reason is that knowing the tertiary structure of proteins can help to determine their functions. The main aim of this paper was to compare performance of Support Vector Machines in predicting the secondary structure of proteins from their amino acid sequences. The following conclusions were derived:
1. This approach created six binary classifiers. The results are obtained from the same window length of 13
2. The experimental shows that increasing the size of the training data set
  
  improves the performance significantly
3. Choosing window size affected in the results . Choosing an appropriate window length also helps to improve the performance.

REFERENCES

Wikipedia Online. Http://en.wikipedia.org/wiki
Lipo Wang, Support Vector Machines: Theory and Applications, Springer.
Pongsametrey Sok and Nguonly Taing Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set Recognition

APSIPA 2014
https://www.khanacademy.org/science/biology/macromolecules/protein s-and-amino-acids/a/orders-of-protein-structure.
http://antheprot-pbil.ibcp.fr.

Quality index	% Accuracy
Overall accuracy	73.5

H/ H	79.36
E/ E	79.15
C/ C	67.10
H/E	72.02
E/C	72.10
C/H	74.66

Application of Support Vector Machine in Prediction Secondary Structure Protein

Leave a Reply