Computational Methods for Protein Structure Prediction

DOI : 10.17577/IJERTV14IS050374

Download Full-Text PDF Cite this Publication

Text Only Version

Computational Methods for Protein Structure Prediction

Gourikrishna O S

Postgraduate Department of Computer Science Prajyoti Niketan College,

Thrissur, India

Maidhili Mohan K

Postgraduate Department of Computer Science Prajyoti Niketan College,

Thrissur, India

Abstract It is one of the hardest tasks in biology to predict the structure of a protein from its amino acid sequence. The shape of a protein dictates how it works, interacts with other molecules, and contributes to health and disease. Historically, to resolve protein structures, researchers have used expensive and time- consuming laboratory techniques like X-ray crystallography and NMR spectroscopy. To get around these limitations, researchers have developed computational techniques, such as homology modeling, which compares proteins to ones with known structures, and ab initio approaches, which create predictions from scratch. In the past few years, artificial intelligencemost notably deep learning models such as AlphaFoldhas revolutionized the process by being able to make correct predictions more quickly and efficaciously than ever before. It being easier to determine protein structures unlocks new avenues for drug discovery, disease investigation, and biotechnology. Traditional experimental methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy are precise but expensive and time-consuming. The sheer growth in known protein sequences due to high-throughput sequencing has frequently outgrown the experimentally known structures. This difference highlights the need for reliable protein structure prediction by computational techniques. Deep learning, particularly the AlphaFold2 model, significantly improves prediction accuracy. These models use evolutionary information and neural network designs to provide high-resolution structural predictions, offering a viable alternative to experimental methods and advancing the field of structural biology.

Keywords Protein structure prediction, AlphaFold, Deep learning, Homology modeling, Ab initio prediction, Structural bioinformatics, Protein folding, Artificial intelligence, CASP, Cryo-EM

  1. INTRODUCTION

    Proteins form the basis of almost every biological process. Among other functions, they catalyze the metabolic processes (enzymes), serve as structural support (collagen), transport molecules (hemoglobin), and control cellular functions (hormones and receptors). Its three-dimensional structure, as defined by its sequence of amino acids, is what a protein's function is inherently based upon. This premise, commonly known as the sequence-structure-function paradigm, is the essence of molecular biology.

    Protein structure determination has conventionally depended on experimental methods such as X-ray crystallography, which is still the standard; nuclear magnetic resonance (NMR) spectroscopy for small- to medium-sized proteins; and cryo- electron microscopy (cryo-EM) for large macromolecular assemblies. Although they are highly accurate, these approaches are time-consuming, labor-intensive, and

    frequently restricted by the quality of available crystals or the dynamical character of the target proteins.

    This constraint has spurred the creation of computational approaches to predict protein structures from amino acid sequences alone, an effort sometimes called the "protein folding problem." The magnitude of this task cannot be underestimated: the Protein Data Bank (PDB) contains experimentally determined structures for a minority of known protein sequences cataloged in databases like UniProt. This gap emphasizes the need for sound, scalable, and rapid predictive tools. Computational methods of protein structure prediction are divided into a number of categories:

    • Homology modeling, or comparative modeling, which is based on sequence resemblance to known structures.

    • Threading or fold recognition, which tries to thread sequences into known structural folds.

    • Ab initio or de novo approaches, which try to predict structures from physical principles without the use of templates.

      Machine learning-based approaches, specifically deep learning models such as AlphaFold, which learns patterns of structure from large datasets. All of these methods have advantages and disadvantages, and they vary in efficiency based on aspects such as available homologous sequences, protein length, and structural diversity. This paper discusses all these methods, compares their accuracy and computational demand, and discusses the recent advances that have made protein structure prediction a feasible tool for scientists across the globe.

  2. LITERATURE SURVEY

    The search for protein structure prediction computationally has come a long way since the last half-century. The first set of methods targeted secondary structure prediction based on statistical propensities of the amino acids. The Chou-Fasman algorithm [1] and the GOR method [2] were among the initial sets of algorithms to implement residue propensities and information theory for the prediction of helices, sheets, and coils. While not very accurate, they set the platform for contemporary approaches. With the expansion of protein sequence databases and the creation of the Protein Data Bank (PDB), homology modeling became a prevalent approach. Methods such as those applied in SWISS-MODEL [3], Modeller [4], and I-TASSER [5] superimpose a target sequence onto a known structure (template) and build the target structure by analogy. Homology modeling has been found to generate models with root-mean-square deviation (RMSD) of 1-2 Ã… from experimental structures if the sequence identity is greater than 30% [6]. Threading approaches started

    gaining recognition when it was seen that proteins with extremely low sequence identity still could fold into similar structures. Structural alignment techniques and statistical potentials are applied by tools such as Phyre2 [7], HHpred [8], and SPARKS-X [9] to evaluate target sequence fit into current structural folds. These approaches are especially useful for orphan proteins without sequence database homologs.

    Ab initio prediction methods, such as the popular Rosetta [10] and QUARK [11], try to fold proteins de novo from energy minimization and fragment assembly. These are computationally intensive and rely on sampling strategies. Although ab initio methods fall short on big proteins, they have been crucial in modeling small, new proteins and investigating folding mechanisms. The integration of machine learning into PSP introduced a new sophistication. These early models predicted secondary structure and contact maps using neural networks.

    These gave rise to deep learning models that could incorporate evolutionary, spatial, and physicochemical information. AlphaFold from DeepMind is a culmination of this development. AlphaFold2, launched in 2020, beat every other method in the 14th Critical Assessment of Structure Prediction (CASP14), obtaining GDT_TS scores higher than 90 for most targets [12].

    Some other important contributions are RosettaFold [13], which merges deep learning with the energy function of Rosetta, and ESMFold [14], which utilizes protein language models. These have extended structure prediction to

    encompass protein complexes and dynamic states.

    This review of the literature synthesizes evidence from more than 25 peer-reviewed publications and benchmarks, as well as CASP competition results, to give an overall pictureof the history, status, and future direction of protein structure prediction. Table 1 is a comparative literature table capturing key computational methods in protein structure prediction.

  3. TYPES OF PROTEIN STRUCTURE PREDICTION

    Protein structure prediction methods can be categorized into four broad categories: homology modeling, fold recognition (threading), ab initio (de novo) prediction, and hybrid/machine learning techniques. Each of these categories differs in the assumptions made by them, their computational complexity, as well as their predictive accuracy.

    1. A. Homology Modeling

      Homology modeling has the advantage of the fact that proteins with similar sequences have the same structures. If the structure of a homologous protein is found in the Protein Data Bank, it can be utilized as a template to forecast the unknown structure of a target protein. It is also referred to as comparative modeling. The process involves four key steps:

      • Identification of the template with tools such as BLAST or PSI-BLAST

      • Alignment between the sequence and the template

        TABLE 1. Computational Methods for Protein Structure Prediction

        Method / Tool

        Category

        Core Principle

        Accuracy

        Strengths

        Limitations

        Release / Dev

        Chou-Fasman

        Secondary Structure

        Statistical propensities of amino acids

        ~5060%

        Simple, historical value

        Poor precision, outdated

        1974, Chou & Fasman

        GOR Method

        Secondary Structure

        Information theory and statistical context

        ~65%

        Contextual window improves accuracy

        Still less accurate than ML models

        19781980,

        Garnier et al.

        SWISS-

        MODEL

        Homology

        Modeling

        Aligns to known

        template structure

        RMSD 12 Ã…

        if >30% identity

        Automated, user-

        friendly

        Fails for low

        homology proteins

        1993+, Swiss

        Institute

        Modeller

        Homology Modeling

        Spatial restraints and energy minimization

        RMSD ~12 Ã…

        Customizable, scripting-friendly

        Requires good template

        1993, Sali & Blundell

        I-TASSER

        Threading + Assembly

        Threading, ab initio, structure assembly

        Among CASP top performers

        Full-length modeling, active site prediction

        Slow, complex pipeline

        2008, Zhang Lab

        Phyre2

        Threading

        HMM-based profile alignment

        Good for low- homology

        targets

        Online, robust for novel folds

        Accuracy depends on template database

        2011, Kelley et al.

        HHpred

        Threading

        HMM-HMM

        alignment and structure fit

        High template detection power

        Fast and interactive

        Requires manual interpretation

        2005+, Soeding et al.

        SPARKS-X

        Threading

        Profile-profile alignment + energy

        scoring

        Better than Phyre2 in some

        cases

        Strong fold recognition

        Limited user control

        2011, Zhou & Zhou

        Rosetta

        Ab initio

        Fragment assembly, Monte Carlo sampling

        Excellent for

        <100 AA

        Insight into folding mechanisms

        High computational demand

        2000+, Baker Lab

        QUARK

        Ab initio

        Distance/contact- guided fragment

        assembly

        Excellent for small proteins

        Deep learning- based contact

        prediction

        Not suited for large proteins

        2012+, Zhang Lab

        AlphaFold2

        Deep

        Learning

        Evoformer + SE(3)

        transformer

        GDT_TS > 90

        (CASP14)

        Accurate without

        templates

        Does not capture

        full dynamics

        2020,

        DeepMind

        RosettaFold

        Deep Learning Hybrid

        Rosetta physics + Deep learning

        Competitive with AlphaFold

        Models complexes and interfaces

        Slightly less accurate than AlphaFold2

        2021, Baker Lab

        ESMFold

        Protein Language

        Model

        Transformer-based structure prediction

        Very fast and accurate

        No MSA needed, scalable

        Slightly lower accuracy on hard

        targets

        2022, Meta AI

      • Model construction with the programs such as MODELLER

        • The refinement and validation through energy minimization

          The accuracy of a homology model is proportional to the target template sequence identity; models with >50% sequence identity tends to have high accuracy. It is a fast and cheap computer method, thus a structure prediction method of choice for large-scale predictions. However, it is not able to predict novel folds and becomes unreliable if there is no appropriate template [12].

    2. Fold Recognition (Threading)

      Fold recognition, or threading, seeks to find a match between the target protein sequence and a database of existing structural folds, even when there is no high sequence similarity. The approach presumes that the set of possible protein folds is small and that new sequences are bound to take one of the known folds. Algorithms score various structural alignments against statistical potentials, energy scores, or machine learning classifiers. Threading is especially beneficial for those proteins having low sequence identity (<30%) but available structural analogs. Nevertheless, its precision is limited by the depth of the fold structural database and the capability of the scoring function to identify correct from incorrect alignments.

    3. Ab Initio (De Novo) Prediction

      Ab initio prediction seeks to predict protein structure entirely from its amino acid sequence, independent of known templates. This approach imitates protein folding based on physical principles, such as thermodynamics and molecular mechanics. Energy functions estimate the free energy landscape of protein conformations, and optimization algorithms seek the global minimum energy structure. Rosetta, which was created at the University of Washington, is a well- known ab initio tool that employs fragment assembly, in which short structural motifs are sampled and assembled to create full length models [7]. QUARK follows the same strategy but employs a coarse-grained description and knowledge-based energy functions. Ab initio approaches are computationally demanding and are typically restricted to small proteins (<150 residues). In spite of their limitations, they are still crucial for finding new protein folds and learning about folding pathways.

    4. Hybrid and Machine Learning Strategies

    Current protein structure prediction mostly relies on hybrid methods that incorporate elements of homology modeling, threading, ab initio approaches, and machine learning. The methods are facilitated by the availability of enormous sequence and structure databases, enhanced computational power, and novel neural network architetures.

    AlphaFold2 is a major advance in this class. It employs an attention-based deep learning architecture known as Evoformer to read multiple sequence alignments (MSAs) and make pair wise residue distance predictions, torsion angle predictions, and atomic coordinate predictions [9]. In contrast

    to other models, AlphaFold2 makes end-to-end predictions, learning structure as part of the learning process itself. It achieved record-setting accuracy at CASP14, beating all other models by a huge margin.

    Other significant models include RoseTTAFold, which combines sequence, distance, and 3D coordinate information into an integrated network, and OmegaFold, which eliminates MSA creation altogether. These techniques are transforming structural biology by providing high-throughput, reliable predictions even for proteins with few homologues.

  4. ALPHAFOLD AND RECENT DEEP LEARNING ADVANCES

    The release of AlphaFold2 by DeepMind was a groundbreaking leap in protein structure prediction. AlphaFold2 uses deep learning, and specifically transformer architectures and attention mechanisms, to predict inter- residue distance and orientation and result in atomic-level models of proteins. AlphaFold2's central module, Evoformer, combines MSAs and pair wise residue characteristics to construct a 3D representation of the protein. It progressively updates structural predictions with a recycling approach, enhancing the output with every pass. The structure module then converts the enhanced representations into 3D coordinates. This method greatly surpasses earlier template- based and ab initio approaches. At CASP14, AlphaFold2 attained median Global Distance Test (GDT) values of over 90, a standard regarded as the equivalent of experimental quality for most targets. Subsequent predictions have been proven against X-ray and cryoEM structures, illustrating real- world usability. In the wake of AlphaFold2's triumph, a number of other models have ensued:

    1. RoseTTAFold: Created by the Baker lab, this model utilizes a three-track network that processes sequence, distance, and coordinate data simultaneously, performing as well as AlphaFold2 on most targets [11].

    2. ESMFold: Built by Meta AI, the model applies protein language models trained on huge sequence datasets to make structure predictions independent of MSAs [15].

    3. OmegaFold: An MSA-free method forecasting the protein structures directly from the sequence with transformer architecture [16].

    These breakthroughs shows end-to-end deep learning models minimizing dependence on classical bioinformatics preprocessing and allowing structure prediction at scales and accuracy unprecedented before.

  5. EVALUATION METRICS AND BENCHMARKS

    Proper evaluation of protein structure prediction is critical to benchmark other approaches and inform future development. Various metrics and evaluation platforms have been designed for this reason.

    1. CASP (Critical Assessment of Protein Structure Prediction)

      CASP is a biennial, community experiment aimed at objectively determining the state of the art in protein structure prediction. Participants make blind predictions on proteins whose structures are being experimentally determined but not yet publicly released. Predictions are matched against experimental results by an independent panel of assessors. CASP has played a critical role in monitoring progress in the area. To illustrate, CASP14 in 2020 showed AlphaFold2's unprecedented success, as the model registered a median GDT_TS score greater than 90, signifying atomic-level accuracy [17]. Such findings made deep learning the new gold standard in PSP.

    2. GDT, RMSD, and TM-score

    • GDT_TS (Global Distance Test Total Score): Quantifies the amount of residues in the predicted structure that lie within a set of distance thresholds from the experimental structure. A GDT_TS greater than 70 is typically regarded as good quality.

    • RMSD (Root Mean Square Deviation): Measures the average distance between atoms in predicted and experimental structures. It is the protein size and locally sensitive.

    • TM-score (Template modeling score): Scale-invariant measure for comparing the topology of two protein structures. TM-score >0.5 shows the successful fold prediction.

    These measures are commonly used in both community assessments and single benchmarking studies, aiding in measuring model quality and informing method improvement [18].

  6. APPLICATIONS

    Protein Structure Prediction (PSP) has been an instrument with far-reaching applications in large scientific fields. The consequences of well-predicted protein structures lie far beyond mere scientific interest and have real-world applications in medicine, industry, and biology.

    1. Drug Discovery

      Drug discovery is the major beneficiary of PSP advancements. Most drugs act by binding to certain proteins, usually enzymes, receptors, or ion channels. Knowledge of the 3D structure of a protein is useful in the identification of active or binding sites where small molecules (drugs) can bind.

      Computational approaches such as structure-based drug design (SBDD) are dependent on good protein models to virtually screen large compound libraries against the binding pockets of the protein.

      AlphaFold prediction have allowed scientists to model proteins for which experimental structures previously were not available, speeding drug discovery against diseases like cancer, tuberculosis, and COVID-19 [19].

    2. Enzyme Design

      In industrial biotechnology, enzymes are applied in everything from food processing to bio fuel manufacture. Rational enzyme design to increase activity, specificity, or stability at varying environmental conditions needs to be informed by the structural conformation of the enzymes. PSP allows for the rational design of enzymes, where mutations are introduced into the predicted structures to customize functionality. Tools such as RosettaDesign and AlphaFold prediction are used to forecast how these changes affect folding and active site geometry, substantially minimizing experimental trial and error [20].

    3. Disease Research

      Most diseases are formed from mutations leading to proteins misfolding or being nonfunctional. The neurodegenerative diseases like Alzheimer's, Parkinson's, and Huntington's, for example, are typified by protein misfolding and aggregation. PSP supports research on the effects of some genetic mutations on protein structure and disease pathology. Prediction of the structure of mutant proteins is beneficial in the development of targeted therapies or evaluation of the effect of the precision medicine strategies for conditions like cystic fibrosis and some cancers [21].

    4. Synthetic Biology

    In synthetic biology, scientists engineer novel proteins or redesign existing proteins for new purposes. These include biosensors, biochemical pathways for making drugs or degrading pollutant proteins. PSP plays a key role in guaranteeing designed proteins fold properly and perform their desired functions. For example, predictions by AlphaFold are now employed to validate the feasibility of engineered sequences and to optimize design approaches, opening the door to de novo protein engineering, designing proteins ab initio with desired structural characteristics [22].

  7. CONCLUSION

Protein structure prediction might revolutionize biology and medicine. The AI-driven techniques achieve near- experimental accuracy, breaking down the hurdles of understanding proteomics. The interdisciplinary research integating bioinformatics, physics, and AI might lead to significant discoveries in the near future.

ACKNOWLEDGMENT

We acknowledge the productive contributions of the scientific community whose work in field of the protein structure prediction has laid the foundation for this work. Special thanks to the developers and researchers behind the pioneering tools such as AlphaFold, Rosetta[13], and SWISS-MODEL [3], whose innovations continue to drive the field forward. We also appreciate the resources provided by public databases like the Protein Data Bank and the CASP community for fostering transparency and bench marking. This analysis would not have been possible without the collective efforts of enormous researchers over the past five decades.

REFERENCES

  1. Anfinsen, C.B. (1973). Principles that govern the folding of protein chains. Science, 181(4096), 223-230.

  2. UniProt Consortium. (2023). UniProt: the universal protein knowledge base. Nucleic Acids Res, 51(D1), D523D531.

  3. Chou, P.Y., & Fasman, G.D. (1974). Prediction of protein conformation. Biochemistry, 13(2), 222-245.

  4. Garnier, J., Osguthorpe, D.J., & Robson, B. (1978). Analysis of the accuracy and implications of simple methods for predicting protein secondary structure. J. Mol. Biol., 120(1), 97-120.

  5. Martí-Renom, M.A., et al. (2000). Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct., 29, 291-325.

  6. Kelley, L.A., et al. (2015). The Phyre2 web portal for protein modeling. Nat. Protoc., 10, 845858.

  7. Xu, D., & Zhang, Y. (2012). Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins, 80(7), 1715-1735.

  8. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292(2), 195202.

  9. Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583589.

  10. Senior, A.W., et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706710.

  11. Baek, M., et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871- 876.

  12. Eswar, N., et al. (2006). Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics, Chapter 5: Unit 5.6.

  13. Roy, A., et al. (2010). I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc., 5(4), 725-738.

  14. Yang, J., et al. (2015). The I-TASSER Suite: protein structure and function prediction. Nat. Methods, 12(1), 7-8.

  15. Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123 1130.

  16. Wu, R., et al. (2022). High-resolution de novo structure prediction from primary sequence. bioRxiv.

  17. Moult, J., et al. (2018). Critical assessment of methods of protein structure prediction: Progress and new directions in round XI. Proteins, 86, 715.

  18. Zemla, A. (2003). LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res., 31(13), 33703374.

  19. Totrov, M., & Abagyan, R. (2008). Flexible protein-ligand docking. Methods Mol. Biol., 443, 243-256.

  20. Siegel, J.B., et al. (2010). Computational design of an enzyme catalyst for a stereoselective bimolecular DielsAlder reaction. Science, 329(5989), 309313.

  21. Soto, C. (2003). Unfolding the role of protein misfolding in neurodegenerative diseases. Nat. Rev. Neurosci., 4(1), 4960.

  22. Wang, H.H., et al. (2009). Programming cells by multiplex genome engineering and accelerated evolution. Nature, 460, 894898.

  23. Bryant, P., Pozzati, G., & Elofsson, A. (2022). Improved prediction of proteinprotein interactions using AlphaFold2. Nat. Commun., 13, 1265.

  24. Mirdita, M., et al. (2022). ColabFold: making protein folding accessible to all. Nat. Methods, 19(6), 679682.

  25. Evans, R., et al. (2021). Protein complex prediction with AlphaFoldMultimer. bioRxiv.

  26. Mirdita, M., et al. (2022). ColabFold: making protein folding accessible to all. Nat. Methods, 19(6), 679682.