

- Open Access
- Authors : Gourikrishna O S, Maidhili Mohan K
- Paper ID : IJERTV14IS050374
- Volume & Issue : Volume 14, Issue 05 (May 2025)
- Published (First Online): 05-06-2025
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Computational Methods for Protein Structure Prediction
Gourikrishna O S
Postgraduate Department of Computer Science Prajyoti Niketan College,
Thrissur, India
Maidhili Mohan K
Postgraduate Department of Computer Science Prajyoti Niketan College,
Thrissur, India
Abstract It is one of the hardest tasks in biology to predict the structure of a protein from its amino acid sequence. The shape of a protein dictates how it works, interacts with other molecules, and contributes to health and disease. Historically, to resolve protein structures, researchers have used expensive and time- consuming laboratory techniques like X-ray crystallography and NMR spectroscopy. To get around these limitations, researchers have developed computational techniques, such as homology modeling, which compares proteins to ones with known structures, and ab initio approaches, which create predictions from scratch. In the past few years, artificial intelligencemost notably deep learning models such as AlphaFoldhas revolutionized the process by being able to make correct predictions more quickly and efficaciously than ever before. It being easier to determine protein structures unlocks new avenues for drug discovery, disease investigation, and biotechnology. Traditional experimental methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy are precise but expensive and time-consuming. The sheer growth in known protein sequences due to high-throughput sequencing has frequently outgrown the experimentally known structures. This difference highlights the need for reliable protein structure prediction by computational techniques. Deep learning, particularly the AlphaFold2 model, significantly improves prediction accuracy. These models use evolutionary information and neural network designs to provide high-resolution structural predictions, offering a viable alternative to experimental methods and advancing the field of structural biology.
Keywords Protein structure prediction, AlphaFold, Deep learning, Homology modeling, Ab initio prediction, Structural bioinformatics, Protein folding, Artificial intelligence, CASP, Cryo-EM
-
INTRODUCTION
Proteins form the basis of almost every biological process. Among other functions, they catalyze the metabolic processes (enzymes), serve as structural support (collagen), transport molecules (hemoglobin), and control cellular functions (hormones and receptors). Its three-dimensional structure, as defined by its sequence of amino acids, is what a protein's function is inherently based upon. This premise, commonly known as the sequence-structure-function paradigm, is the essence of molecular biology.
Protein structure determination has conventionally depended on experimental methods such as X-ray crystallography, which is still the standard; nuclear magnetic resonance (NMR) spectroscopy for small- to medium-sized proteins; and cryo- electron microscopy (cryo-EM) for large macromolecular assemblies. Although they are highly accurate, these approaches are time-consuming, labor-intensive, and
frequently restricted by the quality of available crystals or the dynamical character of the target proteins.
This constraint has spurred the creation of computational approaches to predict protein structures from amino acid sequences alone, an effort sometimes called the "protein folding problem." The magnitude of this task cannot be underestimated: the Protein Data Bank (PDB) contains experimentally determined structures for a minority of known protein sequences cataloged in databases like UniProt. This gap emphasizes the need for sound, scalable, and rapid predictive tools. Computational methods of protein structure prediction are divided into a number of categories:
-
Homology modeling, or comparative modeling, which is based on sequence resemblance to known structures.
-
Threading or fold recognition, which tries to thread sequences into known structural folds.
-
Ab initio or de novo approaches, which try to predict structures from physical principles without the use of templates.
Machine learning-based approaches, specifically deep learning models such as AlphaFold, which learns patterns of structure from large datasets. All of these methods have advantages and disadvantages, and they vary in efficiency based on aspects such as available homologous sequences, protein length, and structural diversity. This paper discusses all these methods, compares their accuracy and computational demand, and discusses the recent advances that have made protein structure prediction a feasible tool for scientists across the globe.
-
-
LITERATURE SURVEY
The search for protein structure prediction computationally has come a long way since the last half-century. The first set of methods targeted secondary structure prediction based on statistical propensities of the amino acids. The Chou-Fasman algorithm [1] and the GOR method [2] were among the initial sets of algorithms to implement residue propensities and information theory for the prediction of helices, sheets, and coils. While not very accurate, they set the platform for contemporary approaches. With the expansion of protein sequence databases and the creation of the Protein Data Bank (PDB), homology modeling became a prevalent approach. Methods such as those applied in SWISS-MODEL [3], Modeller [4], and I-TASSER [5] superimpose a target sequence onto a known structure (template) and build the target structure by analogy. Homology modeling has been found to generate models with root-mean-square deviation (RMSD) of 1-2 Ã… from experimental structures if the sequence identity is greater than 30% [6]. Threading approaches started
gaining recognition when it was seen that proteins with extremely low sequence identity still could fold into similar structures. Structural alignment techniques and statistical potentials are applied by tools such as Phyre2 [7], HHpred [8], and SPARKS-X [9] to evaluate target sequence fit into current structural folds. These approaches are especially useful for orphan proteins without sequence database homologs.
Ab initio prediction methods, such as the popular Rosetta [10] and QUARK [11], try to fold proteins de novo from energy minimization and fragment assembly. These are computationally intensive and rely on sampling strategies. Although ab initio methods fall short on big proteins, they have been crucial in modeling small, new proteins and investigating folding mechanisms. The integration of machine learning into PSP introduced a new sophistication. These early models predicted secondary structure and contact maps using neural networks.
These gave rise to deep learning models that could incorporate evolutionary, spatial, and physicochemical information. AlphaFold from DeepMind is a culmination of this development. AlphaFold2, launched in 2020, beat every other method in the 14th Critical Assessment of Structure Prediction (CASP14), obtaining GDT_TS scores higher than 90 for most targets [12].
Some other important contributions are RosettaFold [13], which merges deep learning with the energy function of Rosetta, and ESMFold [14], which utilizes protein language models. These have extended structure prediction to
encompass protein complexes and dynamic states.
This review of the literature synthesizes evidence from more than 25 peer-reviewed publications and benchmarks, as well as CASP competition results, to give an overall pictureof the history, status, and future direction of protein structure prediction. Table 1 is a comparative literature table capturing key computational methods in protein structure prediction.
-
TYPES OF PROTEIN STRUCTURE PREDICTION
Protein structure prediction methods can be categorized into four broad categories: homology modeling, fold recognition (threading), ab initio (de novo) prediction, and hybrid/machine learning techniques. Each of these categories differs in the assumptions made by them, their computational complexity, as well as their predictive accuracy.
-
A. Homology Modeling
Homology modeling has the advantage of the fact that proteins with similar sequences have the same structures. If the structure of a homologous protein is found in the Protein Data Bank, it can be utilized as a template to forecast the unknown structure of a target protein. It is also referred to as comparative modeling. The process involves four key steps:
-
Identification of the template with tools such as BLAST or PSI-BLAST
-
Alignment between the sequence and the template
TABLE 1. Computational Methods for Protein Structure Prediction
Method / Tool
Category
Core Principle
Accuracy
Strengths
Limitations
Release / Dev
Chou-Fasman
Secondary Structure
Statistical propensities of amino acids
~5060%
Simple, historical value
Poor precision, outdated
1974, Chou & Fasman
GOR Method
Secondary Structure
Information theory and statistical context
~65%
Contextual window improves accuracy
Still less accurate than ML models
19781980,
Garnier et al.
SWISS-
MODEL
Homology
Modeling
Aligns to known
template structure
RMSD 12 Ã…
if >30% identity
Automated, user-
friendly
Fails for low
homology proteins
1993+, Swiss
Institute
Modeller
Homology Modeling
Spatial restraints and energy minimization
RMSD ~12 Ã…
Customizable, scripting-friendly
Requires good template
1993, Sali & Blundell
I-TASSER
Threading + Assembly
Threading, ab initio, structure assembly
Among CASP top performers
Full-length modeling, active site prediction
Slow, complex pipeline
2008, Zhang Lab
Phyre2
Threading
HMM-based profile alignment
Good for low- homology
targets
Online, robust for novel folds
Accuracy depends on template database
2011, Kelley et al.
HHpred
Threading
HMM-HMM
alignment and structure fit
High template detection power
Fast and interactive
Requires manual interpretation
2005+, Soeding et al.
SPARKS-X
Threading
Profile-profile alignment + energy
scoring
Better than Phyre2 in some
cases
Strong fold recognition
Limited user control
2011, Zhou & Zhou
Rosetta
Ab initio
Fragment assembly, Monte Carlo sampling
Excellent for
<100 AA
Insight into folding mechanisms
High computational demand
2000+, Baker Lab
QUARK
Ab initio
Distance/contact- guided fragment
assembly
Excellent for small proteins
Deep learning- based contact
prediction
Not suited for large proteins
2012+, Zhang Lab
AlphaFold2
Deep
Learning
Evoformer + SE(3)
transformer
GDT_TS > 90
(CASP14)
Accurate without
templates
Does not capture
full dynamics
2020,
DeepMind
RosettaFold
Deep Learning Hybrid
Rosetta physics + Deep learning
Competitive with AlphaFold
Models complexes and interfaces
Slightly less accurate than AlphaFold2
2021, Baker Lab
ESMFold
Protein Language
Model
Transformer-based structure prediction
Very fast and accurate
No MSA needed, scalable
Slightly lower accuracy on hard
targets
2022, Meta AI
-
Model construction with the programs such as MODELLER
-
The refinement and validation through energy minimization
The accuracy of a homology model is proportional to the target template sequence identity; models with >50% sequence identity tends to have high accuracy. It is a fast and cheap computer method, thus a structure prediction method of choice for large-scale predictions. However, it is not able to predict novel folds and becomes unreliable if there is no appropriate template [12].
-
-
-
Fold Recognition (Threading)
Fold recognition, or threading, seeks to find a match between the target protein sequence and a database of existing structural folds, even when there is no high sequence similarity. The approach presumes that the set of possible protein folds is small and that new sequences are bound to take one of the known folds. Algorithms score various structural alignments against statistical potentials, energy scores, or machine learning classifiers. Threading is especially beneficial for those proteins having low sequence identity (<30%) but available structural analogs. Nevertheless, its precision is limited by the depth of the fold structural database and the capability of the scoring function to identify correct from incorrect alignments.
-
Ab Initio (De Novo) Prediction
Ab initio prediction seeks to predict protein structure entirely from its amino acid sequence, independent of known templates. This approach imitates protein folding based on physical principles, such as thermodynamics and molecular mechanics. Energy functions estimate the free energy landscape of protein conformations, and optimization algorithms seek the global minimum energy structure. Rosetta, which was created at the University of Washington, is a well- known ab initio tool that employs fragment assembly, in which short structural motifs are sampled and assembled to create full length models [7]. QUARK follows the same strategy but employs a coarse-grained description and knowledge-based energy functions. Ab initio approaches are computationally demanding and are typically restricted to small proteins (<150 residues). In spite of their limitations, they are still crucial for finding new protein folds and learning about folding pathways.
-
Hybrid and Machine Learning Strategies
Current protein structure prediction mostly relies on hybrid methods that incorporate elements of homology modeling, threading, ab initio approaches, and machine learning. The methods are facilitated by the availability of enormous sequence and structure databases, enhanced computational power, and novel neural network architetures.
AlphaFold2 is a major advance in this class. It employs an attention-based deep learning architecture known as Evoformer to read multiple sequence alignments (MSAs) and make pair wise residue distance predictions, torsion angle predictions, and atomic coordinate predictions [9]. In contrast
to other models, AlphaFold2 makes end-to-end predictions, learning structure as part of the learning process itself. It achieved record-setting accuracy at CASP14, beating all other models by a huge margin.
Other significant models include RoseTTAFold, which combines sequence, distance, and 3D coordinate information into an integrated network, and OmegaFold, which eliminates MSA creation altogether. These techniques are transforming structural biology by providing high-throughput, reliable predictions even for proteins with few homologues.
-
-
ALPHAFOLD AND RECENT DEEP LEARNING ADVANCES
The release of AlphaFold2 by DeepMind was a groundbreaking leap in protein structure prediction. AlphaFold2 uses deep learning, and specifically transformer architectures and attention mechanisms, to predict inter- residue distance and orientation and result in atomic-level models of proteins. AlphaFold2's central module, Evoformer, combines MSAs and pair wise residue characteristics to construct a 3D representation of the protein. It progressively updates structural predictions with a recycling approach, enhancing the output with every pass. The structure module then converts the enhanced representations into 3D coordinates. This method greatly surpasses earlier template- based and ab initio approaches. At CASP14, AlphaFold2 attained median Global Distance Test (GDT) values of over 90, a standard regarded as the equivalent of experimental quality for most targets. Subsequent predictions have been proven against X-ray and cryoEM structures, illustrating real- world usability. In the wake of AlphaFold2's triumph, a number of other models have ensued:
-
RoseTTAFold: Created by the Baker lab, this model utilizes a three-track network that processes sequence, distance, and coordinate data simultaneously, performing as well as AlphaFold2 on most targets [11].
-
ESMFold: Built by Meta AI, the model applies protein language models trained on huge sequence datasets to make structure predictions independent of MSAs [15].
-
OmegaFold: An MSA-free method forecasting the protein structures directly from the sequence with transformer architecture [16].
These breakthroughs shows end-to-end deep learning models minimizing dependence on classical bioinformatics preprocessing and allowing structure prediction at scales and accuracy unprecedented before.
-
-
EVALUATION METRICS AND BENCHMARKS
Proper evaluation of protein structure prediction is critical to benchmark other approaches and inform future development. Various metrics and evaluation platforms have been designed for this reason.
-
CASP (Critical Assessment of Protein Structure Prediction)
CASP is a biennial, community experiment aimed at objectively determining the state of the art in protein structure prediction. Participants make blind predictions on proteins whose structures are being experimentally determined but not yet publicly released. Predictions are matched against experimental results by an independent panel of assessors. CASP has played a critical role in monitoring progress in the area. To illustrate, CASP14 in 2020 showed AlphaFold2's unprecedented success, as the model registered a median GDT_TS score greater than 90, signifying atomic-level accuracy [17]. Such findings made deep learning the new gold standard in PSP.
-
GDT, RMSD, and TM-score
-
GDT_TS (Global Distance Test Total Score): Quantifies the amount of residues in the predicted structure that lie within a set of distance thresholds from the experimental structure. A GDT_TS greater than 70 is typically regarded as good quality.
-
RMSD (Root Mean Square Deviation): Measures the average distance between atoms in predicted and experimental structures. It is the protein size and locally sensitive.
-
TM-score (Template modeling score): Scale-invariant measure for comparing the topology of two protein structures. TM-score >0.5 shows the successful fold prediction.
These measures are commonly used in both community assessments and single benchmarking studies, aiding in measuring model quality and informing method improvement [18].
-
-
APPLICATIONS
Protein Structure Prediction (PSP) has been an instrument with far-reaching applications in large scientific fields. The consequences of well-predicted protein structures lie far beyond mere scientific interest and have real-world applications in medicine, industry, and biology.
-
Drug Discovery
Drug discovery is the major beneficiary of PSP advancements. Most drugs act by binding to certain proteins, usually enzymes, receptors, or ion channels. Knowledge of the 3D structure of a protein is useful in the identification of active or binding sites where small molecules (drugs) can bind.
Computational approaches such as structure-based drug design (SBDD) are dependent on good protein models to virtually screen large compound libraries against the binding pockets of the protein.
AlphaFold prediction have allowed scientists to model proteins for which experimental structures previously were not available, speeding drug discovery against diseases like cancer, tuberculosis, and COVID-19 [19].
-
Enzyme Design
In industrial biotechnology, enzymes are applied in everything from food processing to bio fuel manufacture. Rational enzyme design to increase activity, specificity, or stability at varying environmental conditions needs to be informed by the structural conformation of the enzymes. PSP allows for the rational design of enzymes, where mutations are introduced into the predicted structures to customize functionality. Tools such as RosettaDesign and AlphaFold prediction are used to forecast how these changes affect folding and active site geometry, substantially minimizing experimental trial and error [20].
-
Disease Research
Most diseases are formed from mutations leading to proteins misfolding or being nonfunctional. The neurodegenerative diseases like Alzheimer's, Parkinson's, and Huntington's, for example, are typified by protein misfolding and aggregation. PSP supports research on the effects of some genetic mutations on protein structure and disease pathology. Prediction of the structure of mutant proteins is beneficial in the development of targeted therapies or evaluation of the effect of the precision medicine strategies for conditions like cystic fibrosis and some cancers [21].
-
Synthetic Biology
In synthetic biology, scientists engineer novel proteins or redesign existing proteins for new purposes. These include biosensors, biochemical pathways for making drugs or degrading pollutant proteins. PSP plays a key role in guaranteeing designed proteins fold properly and perform their desired functions. For example, predictions by AlphaFold are now employed to validate the feasibility of engineered sequences and to optimize design approaches, opening the door to de novo protein engineering, designing proteins ab initio with desired structural characteristics [22].
-
-
CONCLUSION
Protein structure prediction might revolutionize biology and medicine. The AI-driven techniques achieve near- experimental accuracy, breaking down the hurdles of understanding proteomics. The interdisciplinary research integating bioinformatics, physics, and AI might lead to significant discoveries in the near future.
ACKNOWLEDGMENT
We acknowledge the productive contributions of the scientific community whose work in field of the protein structure prediction has laid the foundation for this work. Special thanks to the developers and researchers behind the pioneering tools such as AlphaFold, Rosetta[13], and SWISS-MODEL [3], whose innovations continue to drive the field forward. We also appreciate the resources provided by public databases like the Protein Data Bank and the CASP community for fostering transparency and bench marking. This analysis would not have been possible without the collective efforts of enormous researchers over the past five decades.
REFERENCES
-
Anfinsen, C.B. (1973). Principles that govern the folding of protein chains. Science, 181(4096), 223-230.
-
UniProt Consortium. (2023). UniProt: the universal protein knowledge base. Nucleic Acids Res, 51(D1), D523D531.
-
Chou, P.Y., & Fasman, G.D. (1974). Prediction of protein conformation. Biochemistry, 13(2), 222-245.
-
Garnier, J., Osguthorpe, D.J., & Robson, B. (1978). Analysis of the accuracy and implications of simple methods for predicting protein secondary structure. J. Mol. Biol., 120(1), 97-120.
-
MartÃ-Renom, M.A., et al. (2000). Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct., 29, 291-325.
-
Kelley, L.A., et al. (2015). The Phyre2 web portal for protein modeling. Nat. Protoc., 10, 845858.
-
Xu, D., & Zhang, Y. (2012). Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins, 80(7), 1715-1735.
-
Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292(2), 195202.
-
Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583589.
-
Senior, A.W., et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706710.
-
Baek, M., et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871- 876.
-
Eswar, N., et al. (2006). Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics, Chapter 5: Unit 5.6.
-
Roy, A., et al. (2010). I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc., 5(4), 725-738.
-
Yang, J., et al. (2015). The I-TASSER Suite: protein structure and function prediction. Nat. Methods, 12(1), 7-8.
-
Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123 1130.
-
Wu, R., et al. (2022). High-resolution de novo structure prediction from primary sequence. bioRxiv.
-
Moult, J., et al. (2018). Critical assessment of methods of protein structure prediction: Progress and new directions in round XI. Proteins, 86, 715.
-
Zemla, A. (2003). LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res., 31(13), 33703374.
-
Totrov, M., & Abagyan, R. (2008). Flexible protein-ligand docking. Methods Mol. Biol., 443, 243-256.
-
Siegel, J.B., et al. (2010). Computational design of an enzyme catalyst for a stereoselective bimolecular DielsAlder reaction. Science, 329(5989), 309313.
-
Soto, C. (2003). Unfolding the role of protein misfolding in neurodegenerative diseases. Nat. Rev. Neurosci., 4(1), 4960.
-
Wang, H.H., et al. (2009). Programming cells by multiplex genome engineering and accelerated evolution. Nature, 460, 894898.
-
Bryant, P., Pozzati, G., & Elofsson, A. (2022). Improved prediction of proteinprotein interactions using AlphaFold2. Nat. Commun., 13, 1265.
-
Mirdita, M., et al. (2022). ColabFold: making protein folding accessible to all. Nat. Methods, 19(6), 679682.
-
Evans, R., et al. (2021). Protein complex prediction with AlphaFoldMultimer. bioRxiv.
-
Mirdita, M., et al. (2022). ColabFold: making protein folding accessible to all. Nat. Methods, 19(6), 679682.