🔒
International Research Platform
Serving Researchers Since 2012

AI-Based Platform for Drug-Pathogen Molecular Interaction Analysis: A Full-Stack Adaptive Framework with Random Forest Affinity Prediction and Physicochemical Profiling

DOI : https://doi.org/10.5281/zenodo.20000888
Download Full-Text PDF Cite this Publication

Text Only Version

AI-Based Platform for Drug-Pathogen Molecular Interaction Analysis: A Full-Stack Adaptive Framework with Random Forest Affinity Prediction and Physicochemical Profiling

Dr. Varsha S Jadhav

Information Science and Engineering SDM College of Engineering and Technology Dharwad, India

Shankar Patil

Information Science and Engineering SDM College of Engineering and Technology Dharwad, India

Keshav Terdal

Information Science and Engineering SDM College of Engineering and Technology Dharwad, India

Manikanth R Hebballi

Information Science and Engineering SDM College of Engineering and Technology Dharwad, India

Vaibhav Kalyanshetti

Information Science and Engineering SDM College of Engineering and Technology Dharwad,India

Abstract – Growing concerns over antimicrobial resistance (AMR) have created an urgent need to rethink how candidate drugs are identified in early development stages. Existing physics-driven methods such as molecular docking and highthroughput screening (HTS) suffer from high computational overhead and poor scalability, rendering them unsuitable for keeping pace with rapidly evolving pathogenic mutations. This work introduces a modular, full-stack computational system that tightly couples machine learning-based affinity estimation, onthe-fly molecular characterisation, and live interaction scoring within one unified pipeline. At the core of the system lies a Random Forest (RF) ensemble optimised for drugtarget binding affinity regression. The primary novelty is the Adaptive Affinity Module, which constructs 1044-dimensional input vectors by concatenating 512-bit ECFP4 Morgan Fingerprints encoding small-molecule ligands with 532-element integer-mapped protein sequence representations. A penalty-weighted scoring engine built on RDKit evaluates molecular stability to deterministically identify high-quality binding candidates. For shortlisted complexes, five-dimensional ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiles are generated in real time, yielding actionable drug-likeness estimates. The Mutation Laboratory has been substantially upgraded into a Physicochemical and Evolutionary Intelligence Engine, incorporating BLOSUM62-guided evolutionary impact scoring, lightweight 3D burial analysis, and a context-aware physicochemical refinement formula that adjusts the predicted G beyond raw ML output. Testing on 500 curated drugprotein records drawn from BindingDB and ChEMBL produced an RMSE of 0.62 and R2 of 0.89, with predictions generated in approximately 1.2 seconds per query.

Index TermsAntimicrobial Resistance (AMR), Random Forest, Morgan Fingerprints (ECFP4), ADMET Profiling, Drug Target Interaction (DTI), Binding Affinity, SMILES, RDKit, BLOSUM62, Mutation Resistance Simulation, Evolutionary Impact, Residue Burial Analysis, Full-Stack AI, Flask, React

  1. INTRODUCTION

    1. The Global Burden of Antimicrobial Resistance

      Among the foremost threats to global public health in the modern era, antimicrobial resistance (AMR) stands out for its potential scale and urgency. Projections from the World Health Organization suggest that without coordinated intervention, fatalities attributable to drug-resistant pathogens may exceed 10 million per year by 2050, overtaking oncological diseases as a primary cause of death [1]. Murray et al. [4] reported through a comprehensive 2022 study that AMR accounted for roughly 1.27 million direct deaths worldwide in 2019, while Laxminarayan et al. [3] underscored the necessity of globally coordinated countermeasures. The proliferation of multidrug-resistant (MDR) organismsamong them Methicillin-Resistant Staphylococcus aureus (MRSA), carbapenem-resistant Enterobacteriaceae (CRE), and extensively drug-resistant Mycobacterium tuberculosis (XDR-TB)continues to erode the effectiveness of available antibiotics [2]. Well-documented biochemical resistance strategies, such as enzymatic drug inactivation, efflux-mediated expulsion, and alteration of binding targets, collectively challenge the continued utility of current treatment regimens [42], [43].

      Conventional drug development workflows encompass target identification, compound screening, lead refinement, and multi-phase clinical evaluation, with typical timelines of 10 15 years and expenditures reaching USD 2.6 billion per successful approval [5]. Research by Paul et al. [6] pinpointed high late-stage attrition as the dominant cost driver, motivating investment

      in more reliable upstream predictive technologies. Structure-based computational toolsincluding AutoDock Vina [7], Glide [9], and GROMACS [46]have partially accelerated early phases yet impose significant resource demands, often requiring dedicated computing clusters and extended runtimes per compound [51]. Although structure-guided virtual screening strategies [52] informed by the Protein Data Bank [40] have expanded screening capacity, their dependence on experimentally resolved 3D structures remains a fundamental bottleneck.

    2. Limitations of Existing Approaches

      Three recurring weaknesses characterise contemporary DTI prediction and screening frameworks:

      • Throughput and Scalability Constraints: Structure based docking algorithms exhaustively traverse 3D conformational landscapes of protein active sites. Processing a single compound with AutoDock Vina may consume 2060 minutes, making proteome-scale screening infeasible in standard laboratory settings.

      • Fragmented Output Reporting: The majority of DTI tools output only a binding affinity estimate (G or Kd), leaving out the broader physicochemical contextsuch as Lipinski compliance, metabolic vulnerability, and safety flagsthat clinical translation demands.

      • Limited Transparency and Mutation Modelling: Many deep learning DTI systems (e.g., DeepDTA, GraphDTA) function as black boxes, providing no insight into which molecular features govern affinity predictions. Furthermore, these models lack mechanisms to assess how point mutations or single-nucleotide polymorphisms (SNPs) within pathogen targets translate to drug resistance, and none incorporate evolutionary or structural context into their mutation analysis.

    3. Proposed Solution and Contributions

    The proposed platform directly tackles each of these shortcomings through an integrated AI-driven approach. The principal contributions of this work include:

    1. A novel 1044-dimensional molecular representation formed by combining ECFP4 ligand fingerprints with ordinally encoded protein sequences.

    2. A tuned Random Forest ensemble delivering R2 = 0.89 and RMSE = 0.62 with near-instantaneous inference (1.2 s per query).

    3. An embedded five-axis ADMET scoring module providing real-time pharmacokinetic assessment within the prediction loop.

    4. An upgraded Physicochemical and Evolutionary Intelligence Engine within the Mutation Laboratory, incorporating BLOSUM62 evolutionary scoring, 3D residue burial analysis, and a physicochemical refinement formula for biologically grounded G estimation.

    5. A holographic 3D visualisation interface dynamically linked to the mutation calculator for spatial context analysis.

    6. A production-ready full-stack system secured via JWT tokens and PBKDF2-HMAC-SHA256 password hashing.

  2. LITERATURE SURVEY

    Computational methods for drugtarget interaction (DTI) prediction have undergone a substantial evolution, transitioning from knowledge-driven pharmacophore rules to sophisticated data-centric paradigms leveraging machine learning and deep neural networks. Table I presents a structured comparison of notable methods spanning 2018 to 2025.

    1. Sequence-Based Deep Learning

      The DeepDTA framework [10] introduced the use of character-level SMILES and amino acid sequence embeddings fed into dual-branch 1D convolutional networks, achieving a concordance index (CI) of 0.878 on the Davis kinase benchmark [31]. However, its reliance on n-gram character features limits chemical interpretability. DeepConv-DTI [14] augmented this architecture by employing multi-scale convolutions across protein sub-sequences. TransformerCPI [11] adopted self-attention mechanisms [19] to improve classification performance, though its GPU memory demands escalate severely for sequences exceeding 1000 residues.

    2. Graph Neural Network Approaches

      In GraphDTA [15], drug molecules are modelled as attributed graphs where atomic nodes exchange neighbourhood features via message-passing [18], enabling topologicallyaware chemical encoding. Complementary work by Tsubaki et al. [47] demonstrated end-to-end graph-sequence learning for compoundprotein interaction, while NeoDTI [48] extended this to heterogeneous biological networks. Despite their representational richness compared to fingerprint vectors, graphbased pipelines incur non-trivial construction and batching overheads that conflict with real-time inference requirements.

      TABLE I

      Comparative Literature Survey of DrugTarget Interaction Prediction Methods (20182025)

      Reference

      Year

      Method

      Dataset

      Key Metric

      Limitation

      DeepDTA [10]

      2018

      CNN (SMILES

      Seq)

      +

      Davis, KIBA

      = 0.878

      No ADMET;

      slow

      training

      GraphDTA [15]

      2021

      GNN + CNN

      Davis, KIBA

      MSE = 0.229

      Graph

      constructio

      n overhead

      TransformerCPI [11]

      2020

      Transformer

      Human, C. elegans

      AUC = 0.97

      GPU-intensive; no regression

      SCTDTI [12]

      2022

      Siamese CNN

      BindingDB

      RMSE = 0.71

      No mutation analysis

      AttentionDTA [13]

      2020

      Attention + BiGRU

      Davis

      MSE = 0.230

      Binary interaction only

      MolBERT [20]

      2020

      BERT (SMILES)

      ChEMBL

      AUC = 0.91

      No protein encoding

      DGraphDTI [16]

      2020

      Dual Graph

      DrugBank

      AUC = 0.963

      No affinity regression

      RF-Score [23]

      2010

      Random Forest

      PDBbind

      R2 = 0.77

      3D structure required

      Proposed

      2025

      RF + Evo. Intel.

      BindingDB+ChEMBL

      R2=0.89, RMSE=0.62

      2D base; no full folding

    3. Classical Machine Learning with Fingerprints

      The RF-Score study [23] established that Random Forest models [24] trained on crystallographic interaction fingerprints

      [30] can attain R2 = 0.77 on PDBbind, while Svetnik et al. [25] confirmed the aptness of ensemble trees for highdimensional QSAR feature spaces. The present work builds upon this tradition using Morgan/ECFP4 fingerprints [26], [27] as ligand descriptors, extending the 2D-only paradigm to achieve R2 = 0.89 by fusing sequence-level protein features, thereby bypassing the 3D structure requirement entirely.

    4. ADMET and Drug-Likeness Integration

      Standalone tools such as SwissADME [36] and pkCSM

      [37] offer pharmacokinetic and toxicity profiling but operate in isolation from binding affinity estimators. Classical filter criteria including Lipinskis rule-of-five [35] and Vebers bioavailability guidelines [38] continue to serve as industry-standard oral drug-likeness benchmarks. While pre-trained molecular language models like ChemBERTa [21] have begun integrating ADMET-oriented learning objectives, no prior system has demonstrated their simultaneous delivery with affinity scoring and evolutionary mutation intelligence in a live, full-stack deployment.

    5. Evolutionary Substitution Matrices in Drug Resistance

    The BLOSUM62 matrix [22], derived from conserved protein blocks across diverse species, encodes the log-odds probability of observing one amino acid substituted by another in

    naturally occurring homologous sequences. Unlike PAM matrices which extrapolate from evolutionary models, BLOSUM62 is

    empirically derived and has been validated as the gold standard for local alignment scoring. Its integration into mutation resistance analysis is novel in the DTI context: negative BLOSUM62 scores indicate evolutionarily improbable substitutions that are more likely to destabilise protein fold and alter binding pocket geometry, making them high-impact candidates for drug resistance markers.

  3. SYSTEM ARCHITECTURE

    The proposed system adopts a three-tier design comprising a React/Vite user interface layer, a Flask REST service layer, and a combined Python ML + RDKit computation layer supported by a SQLAlchemy/SQLite storage backend.

    Fig. 1. High-level three-tier system architecture. The React/Vite frontend communicates with the Flask REST backend via JWT-authenticated HTTP requests. The backend dispatches SMILES and FASTA payloads to the RF inference engine and RDKit cheminformatics layer, persisting results to a SQLAlchemy ORM.

    1. Frontend Architecture (React + Vite)

      The client interface is built with React 18 and bundled via Vite, which enables Hot Module Replacement (HMR) for rapid iterative development. The UI is organised into the following functional components:

      • Input Panel: Receives SMILES (ligand) and FASTA (protein) entries with real-time format checking through built-in regex validators.

      • Results Dashboard: Dynamically renders a pentagonal ADMET radar chart via the Tremor library, alongside binding affinity indicators and a molecular stability gauge that refresh upon each completed inference.

      • Mutation Lab Interface: Provides an upgraded interface accepting residue position, substitution amino acid, and optional PDB structure upload; renders the G refinement breakdown as an interactive panel with evolutionary and structural sub-scores.

      • Holographic 3D Viewer: A 3Dmol.js-powered ribbon diagram with real-time residue selection, dynamically linked to the mutation calculator for spatial context analysis of the target site.

      • History Panel: Aggregates prior prediction records from both browser-local storage and the server persistence layer to support longitudinal analysis.

    2. Backend Architecture (Flask REST API)

      The server layer exposes five RESTful endpoints to client applications:

      • POST /predict Receives SMILES and FASTA inputs; returns computed G, ADMET axes, and staility score.

      • POST /mutate Accepts residue index, substitution character, and optional PDB data; returns the refined G with BLOSUM62 and burial sub-scores.

      • POST /structure Accepts PDB payload and residue index; returns burial score and neighbourhood count.

      • GET /history Delivers paginated logs of past prediction events.

      • POST /auth/login Authenticates users and issues a signed JWT access token.

    3. Security and Cryptographic Overlay

    The platform employs a two-layer security model: stateless session management via JSON Web Tokens (JWT) signed with 256-bit HMAC-SHA256, and credential protection through PBKDF2-HMAC-SHA256 key derivation at 480,000 iterationsin line with NIST SP 800-132 guidelines. All network communication is encrypted using HTTPS over TLS 1.3.

  4. WORKFLOW AND MECHANISM

    1. Step 1 Input Ingestion and Validation

      The pipeline accepts two distinct molecular descriptors: a SMILES string encoding the candidate small-molecule

      Fig. 2. End-to-end prediction workflow. SMILES and FASTA inputs are independently featurised, fused into a 1044-D vector, passed through the RF ensemble for G regression, followed by RDKit-based ADMET scoring and optional mutation simulation.

      compound, and an amino acid sequence in FASTA format representing the pathogenic protein target. Structural validity of the SMILES input is confirmed using RDKits MolFromSmiles() parser, while the protein sequence undergoes alphabet enforcement via a custom FASTA validator restricted to

      the canonical 20-residue set. Inputs failing either check are immediately rejected with informative error messages before entering the ML inference stage.

    2. Step 2 Molecular Featurisation

      A validated SMILES entry is transformed into an ECFP4 circular fingerprint (diameter 4, equivalent to Morgan radius r = 2) using RDKits

      AllChem.GetMorganFingerprintAsBitVect() with nbits = 512. The resulting binary vector fL {0,1}512 encodes the local atomic neighbourhood up to two bond hops from every atom.

      Each amino acid in the protein sequence is mapped to an integer via an ordinal scheme (A1 through Y20, with X0 for non-standard residues). The encoded sequence is either padded with zeros or clipped to a uniform length of 532, yielding fP Z532.

      Both vectors are joined by concatenation:

      x = [fL fP] R1044 (1)

    3. Step 3 Random Forest Inference

      The concatenated 1044-dimensional descriptor x is forwarded to the pre-trained RF ensemble. Individual trees ht(x) independently regress a pKd value, and the final ensemble estimate is obtained by averaging across all T = 200 decision trees:

      (2)

    4. Step 4 Binding Free Energy Calculation

      The ensemble-predicted pKd is converted to a Gibbs binding free energy G (kcal/mol) through a standard thermodynamic transformation evaluated at physiological body temperature (T = 310.15K):

      Compounds scoring S > 80 are labelled Stable; those in [60,80] are Moderate; those below 60 are Unstable.

      1. Step 6 ADMET Profiling

        All shortlisted complexes receive scores across five pharmacokinetic dimensions: Absorption (derived from LogP and MW estimates), Distribution (computed from H-bond profile and topological polar surface area, TPSA), Metabolism (CYP450 susceptibility approximated via RBC), Excretion (renal clearance proxy based on MW), and Toxicity (modelled as an inverse drug-likeness index).

      2. Step 7 Physicochemical and Evolutionary Intelligence Engine (Upgraded Mutation Laboratory)

      The Mutation Laboratory has been fundamentally redesigned from a basic sequence-editing tool into a multilayered Physicochemical and Evolutionary Intelligence Engine. The upgraded pipeline executes four tightly coupled analytical stages described in the following subsections and summarised in Algorithm 1.

  5. PHYSICOCHEMICAL AND EVOLUTIONARY INTELLIGENCE ENGINE

    1. Stage 1 BLOSUM62 Evolutionary Impact Scoring

      The first analytical stage queries the BLOSUM62 substitution matrix [22] to evaluate the biological plausibility of a proposed amino acid exchange. BLOSUM62 encodes empirically derived log-odds scores for every pair of amino acid substitutions observed in conserved protein blocks across diverse species, making it the benchmark for quantifying evolutionary conservation.

      For a substitution from wild-type residue a to mutant residue

      b, the system retrieves the log-odds score:

      = BLOSUM62(a,b) (8)

      (3)

      where R = 1.987 × 103 kcal mol1K1. Increasingly negative values of G reflect progressively stronger predicted binding interactions.

      1. Step 5 Molecular Stability Scoring

      Five RDKit-derived physicochemical descriptors are evaluated: molecular weight (MW), lipophilicity (LogP), hydrogen bond donor count (HBD), hydrogen bond acceptor count (HBA), and rotatable bond count (RBC). These feed a penaltybased stability scoring function to produce S [0,100]:

      S = 100 (PMW + PLogP + PBond)

      where penalty terms are defined as:

      (4)

      A negative score ( < 0) indicates an evolutionarily improbable exchangeone rarely observed in nature because it disrupts conserved physicochemical properties of the residue. Substitutions with < 0 are flagged as High Evolutionary Impact, and an explicit stability penalty is incurred:

      Pevo = || × 5.0 if < 0 (9)

      This penalty reflects the well-established biophysical principle that mutations disrupting evolutionarily conserved positions are more likely to perturb the protein fold and remodel the binding pocket geometry. Conversely, substitutions with 0 are labelled Low Evolutionary Impact, and no stability penalty is applied. The normalised evolutionary impact score used downstream is defined as:

      (10)

      PLogP = max(0, (LogP 5) × 3)

      (6)

      PBond = max(0, (RBC 10) × 2)

      (7)

      where max = 11 is the maximum positive BLOSUM62 entry (self-substitution of Tryptophan).

    2. Stage 2 Lightweight 3D Residue Burial Analysis

      The second stage evaluates the topographical context of the mutation site within the proteins three-dimensional structure. When a PDB coordinate file is availableeither uploaded by the user or retrieved from the Protein Data Bank [40] the engine parses the C atom coordinates and computes a burial score for the target residue i:

      (11)

      where ri and rj are the C position vectors of residues i and j respectively, and Bi counts the number of neighbouring residues within the 8.0A radius shell.

      Residues are classified based on their burial score:

      A higher structural multiplier is assigned to buried residues because mutations in the protein core are substantially more likely to propagate conformational distortion toward the binding pocket than equivalent substitutions on surface-exposed loops:

      If no PDB data is supplied, the system defaults to Mstruct =

      1.0 and proceeds with sequence-only analysis.

    3. Stage 3 Physicochemical Refinement of G

      The raw ML-predicted G is refined by combining the evolutionary and structural signals into a single Refinement Multiplier:

      Rtotal = Mstruct × (1.0 + Ievo) (14) The final physicochemically refined binding affinity shift is:

      Grefined = GML × Rtotal + Pevo (15)

      This formulation ensures that the platform responds to chemical reality even in cses where the ML model returns a near-zero or attenuated delta for a single amino acid substitutiona known limitation of sequence-only regressors when applied to subtle point mutations. The additive penalty Pevo from Equation (9) further amplifies the resistance signal for evolutionarily rare substitutions.

    4. Stage 4 Resistance Classification

      The refined Grefined is used as the definitive resistance metric:

      • Grefined > +1.0kcal/mol: flagged as a Resistance Marker.

      • Grefined < 0.5kcal/mol: flagged as a Sensitising Mutation.

      • Otherwise: classified as Neutral.

    5. Holographic 3D Visualisation Interface

      The upgraded Mutation Laboratory is paired with a holographic 3D visualisation module rendered via the 3Dmol.js library. The viewer displays the full protein ribbon diagram and supports:

      • Real-Time Residue Selection: Users click directly on a target residue (e.g., Position 32) in the 3D ribbon to populate the mutation calculator automatically, eliminating manual index entry.

      • Active Structural Link: The viewer is dynamically coupled to the mutation engineupon submitting a substitution, the selected residue is highlighted in red (mutant) against blue (wild-type), and the burial neighbourhood sphere is rendered as a translucent 8.0A shell.

      • Binding Site Proximity Overlay: The spatial distance between the mutated residue and the ligand centroid is computed and displayed, enabling the researcher to visually assess whether the mutation site lies within the primary binding pocket or a distal regulatory region.

    Fig. 3. Holographic 3D visualisation interface of the upgraded Mutation Laboratory. The ribbon diagram displays the full protein structure with Position

    32 selected (highlighted sphere). The 8.0A burial neighbourhood shell is rendered in translucent blue. The active structural link dynamically populates the BLOSUM62 and burial scores in the adjacent mutation calculator panel upon residue selection.

    Algorithm 1 Physicochemical & Evolutionary Intelligence Engine

    Require: Wild-type residue a, mutant residue b, position i, raw

    GML, optional PDB data

    Ensure: Refined Grefined, resistance label

    1: BLOSUM62(a,b)

    2: Ievo ||/max

    3: if < 0 then

    4: Pevo || × 5.0

    5: Flag mutation as High Evolutionary Impact

    6: else

    7: Pevo 0

    8: end if

    9: if PDB data available then

    10: Bi count neighbours within 8.0A

    11: Mstruct 1.5 if Bi > 8 else 1.0

    12: else

    13: Mstruct 1.0

    14: end if

    15: Rtotal Mstruct × (1.0 + Ievo)

    16: Grefined GML × Rtotal + Pevo

    17: if Grefined > +1.0 then

    18: return Grefined, Resistance Marker

    19: else if Grefined < 0.5 then

    20: return Grefined, Sensitising Mutation

    21: else

    22: return Grefined, Neutral

    23: end if

  6. METHODOLOGY

    A. Dataset Preparation

    Training data were sourced from two widely used public repositories: BindingDB [28] and ChEMBL 33 [29]. From these,

    500 experimentally confirmed drugprotein binding records were curated by applying the following selection criteria: (i) assay type restricted to Kd or IC50, (ii) SMILES strings

    E. Evaluation Metrics

    Three complementary regression metrics were employed for performance assessment:

    RMSE (16)

    (17)

    MAE (18)

    where yi denotes the ground-truth experimental pKd and

    yi is the corresponding RF prediction.

    successfully parsed by RDKit, and (iii) protein sequence length

    not exceeding 532 residues. The dataset was partitioned into 400 training samples and 100 test samples using an 80/20 ratio, stratified by pKd quartile to maintain proportional coverage of affinity ranges.

    B. Model Training and Hyperparameter Optimisation

    Model training was conducted using Scikit-learn 1.4 [32] with NumPy [34] handling numerical computations. A systematic hyperparameter search was carried out via 5-fold

    cross-validated GridSearchCV across: n_estimators

    {100,200,300}, max_features {sqrt,log2,0.5}, and min_samples_leaf {1,2,4}. The best-performing configuration used n_estimators = 200, max_features = sqrt, and min_samples_leaf = 1. Feature relevance scores were derived through mean decrease in impurity (MDI) aggregated across all trees.

    C. BLOSUM62 Matrix Integration

    The BLOSUM62 matrix was loaded as a symmetric 20 × 20 lookup table indexed by the standard single-letter amino acid codes. For each mutation query (a b), the log-odds score is retrieved in O(1) time. The matrix is pre-loaded into server memory at application startup to eliminate per-request I/O overhead, ensuring that evolutionary impact scoring adds negligible latency (<2ms per query) to the overall inference pipeline.

    D. Burial Score Computation

    PDB coordinate parsing is performed using a lightweight C-only parser that extracts ATOM records for backbone carbons without loading side-chain or solvent atoms, reducing memory consumption by approximately 70% compared to fullatom parsers. The 8.0A neighbourhood search is implemented as a vectorised NumPy distance computation across the C coordinate matrix, completing in under 15ms for proteins up to 1000 residues.

    Algorithm 2 Adaptive Affinity Module Core Inference Pipeline

    Require: SMILES string s, FASTA sequence q

    Ensure: G, Stability Score S, ADMET profile a

    1: Validate s using RDKit.MolFromSmiles(s)

    2: fL ECFP4(s, radius=2, bits=512)

    3: Validate q against 20-residue amino acid alphabet

    4: fP IntEncode(q, maxLen=532)

    5: x [fL fP]

    6: pK d RF.predict(x)

    7: G RT ln(10pK d)

    8: S 100 (PMW + PLogP + PBond)

    9: a ComputeADMET(s)

    10: return G,S,a

  7. Experimental Results and Discussion

    1. Quantitative Model Performance

      Table II reports evaluation metrics obtained on the withheld 100-sample test partition.

      TABLE II

      Quantitative Performance on 100-Sample Test Set

      Metric

      Proposed RF

      Baseline (DeepDTA)

      RMSE

      0.62

      0.79

      R2

      0.89

      0.74

      MAE

      0.48

      0.61

      Inference Latency (s)

      1.2

      8.4

      Training Time (min)

      4.3

      47.2

      ADMET Integration

      Yes

      No

      Evo. Intelligence

      Yes

      No

      Mutation Analysis

      Yes

      No

    2. Comparison with State-of-the-Art Methods

      Table III benchmarks the proposed model against established DTI methods evaluated on the Davis and BindingDB datasets.

      TABLE III

      Comparison with State-of-the-Art DTI Prediction Methods

      Method

      RMSE

      R2

      Latency

      ADMET

      AutoDock Vina [7]

      0.81

      0.74

      >3600s

      No

      DeepDTA [10]

      0.79

      0.74

      8.4s

      No

      GraphDTA [15]

      0.71

      0.82

      6.1s

      No

      TransformerCPI [11]

      0.68

      0.85

      11.2s

      /td>

      No

      AttentionDTA [13]

      0.74

      0.80

      7.8s

      No

      RF-Score [23]

      0.77

      0.77

      3.2s

      No

      Proposed (RF-1044+Evo)

      0.62

      0.89

      1.2s

      Yes

    3. Evolutionary Intelligence Validation

      To validate the BLOSUM62 integration, two representative mutation scenarios were evaluated on the MRSA PBP2a target: Case A Rare Substitution (Cys Pro): BLOSUM62 score

      = 3 (negative, evolutionarily improbable).

      Ievo = 3/11 = 0.273, Pevo = 15.0kcal/mol

      Mstruct = 1.5 (buried, Bi = 11)

      Rtotal = 1.5 × 1.273 = 1.91

      Grefined = GML × 1.91 + 15.0

      This substitution is classified as a High-Impact Resistance Marker.

      Case B Conservative Substitution (Ile Val): BLOSUM62 score = +3 (positive, evolutionarily tolerated).

      Ievo = 3/11 = 0.273, Pevo = 0

      Mstruct = 1.0 (surface, Bi = 4)

      Rtotal = 1.0 × 1.273 = 1.273

      Grefined = GML × 1.273

      This substitution produces a moderate refined delta, consistent with its known minimal clinical resistance impact.

    4. Mutation Resistance Case Study (G2447T)

      The Mutation Laboratory was validated using the clinically significant MRSA PBP2a target. Introducing the G2447T point mutation into the protein sequence produced:

      Gwild-type = 9.42 kcal/mol

      Gmutant = 7.58 kcal/mol

      GML = +1.84 kcal/mol

      With BLOSUM62 score = 2 (GlyThr, evolutionarily improbable) and burial score Bi = 9 (buried core residue):

      Rtotal = 1.5 × (1 + 2/11) = 1.773

      Grefined = 1.84 × 1.773 + 10.0 = 13.26 kcal/mol

      The elevated refined score reinforces classification as a Resistance Marker, consistent with the established clinical mechanism of methicillin resistance via PBP2a active-site remodelling [41].

    5. Structural Intelligence Analysis

    The concept of structural intelligence, as operationalised in this platform, refers to its capacity to expose which chemical substructures and sequence regions most strongly govern binding affinity. MDI-based feature importance scores from the RF ensemble highlight aromatic systems and nitrogen-rich heterocyclic ECFP4 bits as dominant contributors, aligning well with established pharmacophoric knowledge for kinase inhibitors and DNA gyrase-targeting compounds.

    Fig. 4. Scatter plot of predicted vs. experimental pKd values on the 100sample test set. The red dashed line denotes the ideal y = x regression. The RF ensemble achieves R2 =0.89 and RMSE =0.62.

    Fig. 5. Five-axis ADMET radar chart rendered in the React frontend for a representative CiprofloxacinGyrA complex prediction.

    Fig. 6. Top-20 feature importances derived from mean decrease in impurity (MDI) across the 200-tree RF ensemble.

    proteins. Domain-specific substitution matrices could improve accuracy.

    Burial Score PDB Dependency: The 3D burial analysis requires an available PDB structure. For novel or uncharacterised proteins, AlphaFold2 predicted structures must be used, introducing potential coordinate inaccuracies.

    B. Future Work

    AlphaFold2 Structure Integration: Incorporating predicted 3D coordinates from AlphaFold2 [39] automatically for all mutation queries would eliminate PDB availability as a limitation and provide consistent structural context across all targets.

    Domain-Specific Substitution Matrices: Replacing BLOSUM62 with antibiotic-target-specific substitution matrices derived from curated resistance mutation databases (e.g., CARD, ResFinder) could sharpen evolutionary impact predictions for AMR-relevant proteins.

    Multi-Site Epistasis Modelling: Extending the engine to compute joint G for combinations of mutations would capture epistatic resistance pathways that single-residue analysis misses. Expanded Benchmark Datasets: Training on larger corpora such as PDBbind v2020 (approximately 19,000 complexes) or the full BindingDB collection (over 2.8 million data points) is projected to yield substantial improvements in generalisation and

    RMSE.

    Fig. 7. Structural intelligence panel from the React dashboard showing persubstructure feature contribution scores for a representative fluoroquinolone ligand.

    G. ADMET Analysis: Ciprofloxacin Case Study

    Ciprofloxacina fluoroquinolone antibiotic acting on the GyrA subunit of DNA Gyrasewas selected to validate the ADMET module. The platform assigned a stability score of

    S = 87.4% and confirmed compliance with all five extended Lipinski criteria: MW = 331.3Da (<500), LogP = 0.28 (<5), HBD = 2 (<5), HBA = 8 (10), RBC = 3 (<10).

  8. LIMITATIONS AND FUTURE SCOPE

    A. Current Limitations

    Two-Dimensional Feature Representation: Both the ECFP4 fingerprint and the ordinal protein encoding operate exclusively at the 2D structural and linear sequence levels, unable to capture 3D conformational dynamics or allosteric modulation events.

    Epistatic Mutation Modelling: The current engine evaluates each residue substitution independently. Multi-site epistatic interactions, where co-occurring mutations produce non-additive effects, are not yet modelleda priority for future development. BLOSUM62 Scope: BLOSUM62 was derived from global protein sequence alignments and may not perfectly capture domain-specific substitution tolerances in antibiotic target

  9. CONCLUSION

This paper described the design and evaluation of an integrated full-stack AI platform for drugpathogen molecular interaction analysis. The system resolves three key weaknesses of existing DTI toolsexcessive latency, disconnected output reporting, and lack of explainabilityby unifying a 1044-dimensional feature fusion scheme, a calibrated Random Forest regressor, live ADMET assessment, and an upgraded Physicochemical and Evolutionary Intelligence Engine into one coherent pipeline.

The central upgradeintegration of BLOSUM62 evolutionary impact scoring, lightweight 3D residue burial analysis, and a context-aware physicochemical refinement formula elevates the platform beyond standard machine learning into the domain of scientific intelligence. The refinement multiplier

Rtotal = Mstruct×(1.0+Ievo) ensures that mutation resistance predictions are grounded in both protein structural biology and evolutionary conservation, producing biologically meaningful

Grefined estimates even where the base ML regressor returns attenuated signals.

End-to-end predictions are completed in approximately 1.2 seconds, representing a speedup of roughly three orders of magnitude over AutoDock Vina, while attaining R2 = 0.89 and RMSE = 0.62. The security-hardened deployment with JWT-based session control and PBKDF2-HMAC-SHA256 credential protection makes the platform suitable for collaborative multi-user research settings.

ACKNOWLEDGMENT

The authors acknowledge the Department of Information

Science and Engineering, SDM College of Engineering and Technology, Dharwad, India, for providing the computational resources and institutional support that made this research possible. Gratitude is also extended to the developer communities maintaining RDKit, Scikit-learn, React, and Flask.

REFERENCES

  1. World Health Organization, Antimicrobial resistance: Global report on surveillance, WHO Press, Geneva, Switzerland, Tech. Rep., 2019.

  2. C. L. Ventola, The antibiotic resistance crisis: Part 1: Causes and threats,

    Pharmacy and Therapeutics, vol. 40, no. 4, pp. 277283, 2015.

  3. R. Laxminarayan et al., Antibiotic resistancethe need for global solutions, Lancet, vol. 382, no. 9912, pp. 10571098, 2013.

  4. J. A. DiMasi, H. G. Grabowski, and R. W. Hansen, Innovation in the pharmaceutical industry: New estimates of R&D costs, J. Health Econ., vol. 47, pp. 2033, 2016.

  5. S. M. Paul et al., How to improve R&D productivity: The pharmaceutical industrys grand challenge, Nat. Rev. Drug Discov., vol. 9, no. 3, pp. 203214, 2010.

  6. O. Trott and A. J. Olson, AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading, J. Comput. Chem., vol. 31, no. 2, pp. 455461, 2010.

  7. G. M. Morris et al., AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility, J. Comput. Chem., vol. 30, no. 16, pp. 27852791, 2009.

  8. R. A. Friesner et al., Glide: A new approach for rapid, accurate docking and scoring, J. Med. Chem., vol. 47, no. 7, pp. 17391749, 2004.

  9. H. Ozt¨ urk, A.¨ Ozg¨ ur, and E. Ozkirimli, DeepDTA: Deep drugtarget¨ binding affinity prediction, Bioinformatics, vol. 34, no. 17, pp. i821 i829, 2018.

  10. L. Chen et al., TransformerCPI: Improving compoundprotein interaction prediction, Bioinformatics, vol. 36, no. 16, pp. 44064414, 2020.

  11. T. Zhao, J. Hu, and P. J. Jiang, SCTDTI: Predicting drugtarget interactions via Siamese CNN, IEEE J. Biomed. Health Inform., vol. 26, no. 10, pp. 50995108, 2022.

  12. Q. Zhao et al., AttentionDTA: Drugtarget binding affinity prediction with attention, in Proc. IEEE BIBM, Seoul, South Korea, 2020, pp. 6469.

  13. I. Lee, J. Keum, and H. Nam, DeepConv-DTI: Prediction of drugtarget interactions via deep learning, PLOS Comput. Biol., vol. 15, no. 6, p. e1007129, 2019.

  14. T. Nguyen et al., GraphDTA: Predicting drugtarget binding affinity with graph neural networks, Bioinformatics, vol. 37, no. 8, pp. 1140 1147,

    2021.

  15. M. Jiang et al., Drugtarget affinity prediction using graph neural network and contact maps, RSC Advances, vol. 10, no. 35, pp. 20701 20712,

    2020.

  16. J. Lim et al., Molecular generative model based on conditional variational autoencoder, J. Cheminform., vol. 10, no. 1, p. 31, 2018.

  17. J. Gilmer et al., Neural message passing for quantum chemistry, in Proc. ICML, Sydney, Australia, 2017, pp. 12631272.

  18. A. Vaswani et al., Attention is all you need, in Proc. NeurIPS, Long Beach, CA, USA, 2017, pp. 59986008.

  19. B. Fabian et al., Molecular representation learning with language models, arXiv preprint arXiv:2011.13230, 2020.

  20. S. Chithrananda, G. Grand, and B. Ramsundar, ChemBERTa: Largescale self-supervised pretraining for molecular property prediction, arXiv preprint arXiv:2010.09885, 2020.

  21. S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Acad. Sci. USA, vol. 89, no. 22, pp. 1091510919, 1992.

  22. P. J. Ballester and J. B. O. Mitchell, A machine learning approach to predicting proteinligand binding affinity, Bioinformatics, vol. 26, no. 9,

    pp. 11691175, 2010.

  23. L. Breiman, Random forests, Mach. Learn., vol. 45, no. 1, pp. 532,

    2001.

  24. V. Svetnik et al., Random forest: A classification and regression tool for QSAR modeling, J. Chem. Inf. Comput. Sci., vol. 43, no. 6, pp. 19471958, 2003.

  25. D. Rogers and M. Hahn, Extended-connectivity fingerprints, J. Chem. Inf. Model., vol. 50, no. 5, pp. 742754, 2010.

  26. H. L. Morgan, The generation of a unique machine description for chemical structures, J. Chem. Doc., vol. 5, no. 2, pp. 107113, 1965.

  27. T. Liu et al., BindingDB: A web-accessible database of proteinligand binding affinities, Nucleic Acids Res., vol. 35, pp. D198D201, 2007.

  28. A. Gaulton et al., The ChEMBL database in 2017, Nucleic Acids Res., vol. 45, no. D1, pp. D945D954, 2017.

  29. R. Wang et al., The PDBbind database, J. Med. Chem., vol. 47, no. 12,

    pp. 29772980, 2004.

  30. M. I. Davis et al., Comprehensive analysis of kinase inhibitor selectivity,

    Nat. Biotechnol., vol. 29, no. 11, pp. 10461051, 2011.

  31. F. Pedregosa et al., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., vol. 12, pp. 28252830, 2011.

  32. G. Landrum, RDKit: Open-source cheminformatics software, [Online]. Available: https://www.rdkit.org, 2023.

  33. C. R. Harris et al., Array programming with NumPy, Nature, vol. 585, no. 7825, pp. 357362, 2020.

  34. C. A. Lipinski, Drug-like properties and the causes of poor solubility and poor permeability, J. Pharmacol. Toxicol. Methods, vol. 44, no. 1, pp. 235249, 2000.

  35. A. Daina, O. Michielin, and V. Zoete, SwissADME: A free web tool to evaluate pharmacokinetics and drug-likeness, Sci. Rep., vol. 7, p. 42717, 2017.

  36. D. E. V. Pires, T. L. Blundell, and D. B. Ascher, pkCSM: Predicting small-molecule pharmacokinetic and toxicity properties, J. Med. Chem., vol. 58, no. 9, pp. 40664072, 2015.

  37. D. F. Veber et al., Molecular properties that influence oral bioavailability,

    J. Med. Chem., vol. 45, no. 12, pp. 26152623, 2002.

  38. J. Jumper et al., Highly accurate protein structure prediction with AlphaFold, Nature, vol. 596, pp. 583589, 2021.

  39. H. M. Berman et al., The Protein Data Bank, Nucleic Acids Res., vol. 28, no. 1, pp. 235242, 2000.

  40. J. F. Fisher, S. O. Meroueh, and S. Mobashery, Bacterial resistance to -lactam antibiotics, Chem. Rev., vol. 105, no. 2, pp. 395424, 2005.

  41. J. M. A. Blair et al., Molecular mechanisms of antibiotic resistance, Nat. Rev. Microbiol., vol. 13, no. 1, pp. 4251, 2015.

  42. J. M. Munita and C. A. Arias, Mechanisms of antibiotic resistance,

    Microbiol. Spectr., vol. 4, no. 2, pp. VMBF-0016-2015, 2016.

  43. M. Grinberg, Flask Web Development, 2nd ed. Sebastopol, CA, USA: OReilly Media, 2018.

  44. A. Banks and E. Porcello, Learning React. Sebastopol, CA, USA: OReilly Media, 2017.

  45. M. J. Abraham et al., GROMACS: High performance molecular simulations, SoftwareX, vol. 12, pp. 1925, 2015.

  46. M. Tsubaki, K. Tomii, and J. Sese, Compoundprotein interaction prediction with end-to-end learning, Bioinformatics, vol. 35, no. 2, pp. 309318, 2019.

  47. F. Wan et al., NeoDTI: Neural integration of neighbor information for DTI discovery, Bioinformatics, vol. 35, no. 1, pp. 104111, 2019.

  48. S. M. Lundberg and S.-I. Lee, A unified approach to interpreting model predictions, in Proc. NeurIPS, Long Beach, CA, USA, 2017, pp. 47654774.

  49. P. E. Pope et al., Explainability methods for graph convolutional neural networks, in Proc. CVPR, ong Beach, CA, USA, 2019, pp. 1077210781.

  50. D. B. Kitchen et al., Docking and scoring in virtual screening for drug discovery, Nat. Rev. Drug Discov., vol. 3, no. 11, pp. 935949, 2004.

  51. E. Lionta et al., Structure-based virtual screening for drug discovery,

Curr. Top. Med. Chem., vol. 14, no. 16, pp. 19231938, 2014.