International Academic Publisher
Serving Researchers Since 2012

NADARAI – Redefining Anti-Doping through Artificial Intelligence

DOI : https://doi.org/10.5281/zenodo.18889724
Download Full-Text PDF Cite this Publication

Text Only Version

 

NADARAI – Redefining Anti-Doping through Artificial Intelligence

Anushka Sable, Srujan Bele, Vedika Gawner, Bhavesh Thakare, Siddhesh Gadewar, Prof. Shubhangi Gulhane

Department of Computer Science & Engineering, P. R. Pote Patil College of Engineering & Management, Amravati – 444602, Maharashtra, India

Abstract – Manual anti-doping analysis remains a significant bottleneck in modern sports governance. Reviewing athlete biological data, managing test records, and identifying suspicious patterns through human inspection is slow, inconsistent, and difficult to scale. This paper presents NADARAI, a web-based anti- doping intelligence platform that integrates rule-based analysis and machine learning to automate risk detection in athlete biological data. The proposed system accepts athlete test records via CSV upload, applies deterministic screening rules inspired by commonly reported anti-doping practice and ABP literature, and then passes the data through an Isolation Forest-based unsupervised anomaly detection model. A composite risk score is computed by normalizing and combining rule violation flags with machine learning anomaly scores, enabling ranked prioritization of athlete cases for review. The platform is implemented using React for the frontend, FastAPI for the backend, and SQLite with SQLAlchemy for persistent storage, with OAutp/JWT-based authentication and Role-Based Access Control (RBAC) for secure multi-user access. Testing was performed using synthetic datasets constructed to approximate typical physiological ranges reported in anti-doping and ABP literature [2]. Results demonstrate that the system correctly identifies suspicious profiles while producing few false positives on synthetic normal profiles, and the interactive dashboard enables analysts to visualize trends and manage cases efficiently. NADARAI represents a functional prototype that establishes a practical foundation for AI-assisted anti-doping operations.

Index Terms – Anomaly Detection, Anti-Doping, Isolation Forest, Machine Learning, Risk Scoring, Web Application

  1. INTRODUCTION

    Anti-doping organizations worldwide face an escalating challenge: the volume of athlete biological testing data grows every year, while the resources available for manual review remain constrained. Agencies such as the World Anti-Doping Agency (WADA) and Indias National Anti-Doping Agency (NADA) operate complex testing programmes that generate thousands of records annually, spanning hematological parame- ters, hormonal profiles, competition histories, and whereabouts data [1]. Traditional anti-doping workflows rely on trained personnel to visually inspect these records, compare values against fixed thresholds, and escalate suspicious cases for laboratory analysis. This process is inherently time-consuming, subject to human error, and difficult to audit systematically.

    The Athlete Biological Passport (ABP), introduced by WADA to provide longitudinal monitoring of each athletes biological markers, has improved detection sensitivity compared to single-point thresholds [2]. However, the ABP still depends

    on expert interpretation, requires extended tracking periods before reliable baselines are established, and does not provide automated triage or risk ranking of cases. As a result, suspected cases may remain unreviewed for extended periods, reducing the deterrent effect of the anti-doping programme.

    Artificial intelligence and machine learning offer a com- pelling solution to these limitations. ML models can process large, heterogeneous datasets at scale, identify subtle statistical deviations from expected physiological ranges, and generate quantitative risk indicators that prioritize cases for expert review [9]. In particular, unsupervised anomaly detection methods are well-suited to anti-doping contexts because they do not require a labelled dataset of confirmed doping cases – a resource that is scarce due to the rarity of confirmed positives and legal restrictions on data sharing [7].

    This paper presents NADARAI (National Anti-Doping AI), a web-based platform that brings these capabilities together in a single integrated system. NADARAI combines rule-based screening checks inspired by anti-doping literature together with an Isolation Forest anomaly detection model to generate composite risk scores for each athlete, displayed through an interactive dashboard and supported by role-based case management tools.

    The main contributions of this paper are:

    • A complete system design for an AI-assisted anti-doping web platform covering data ingestion, rule-based analysis, ML-based anomaly detection, and risk score generation.
    • A hybrid scoring methodology that combines deterministic rule violation flags with unsupervised ML anomaly scores into a single normalized risk metric.
    • A functional prototype implementation using modern web technologies (React, FastAPI, SQLite) with JWT-based authentication and RBAC.
    • Experimental validation using synthetic datasets con- structed to approximate typical physiological ranges and screening patterns reported in anti-doping and ABP literature.

    The remainder of this paper is organized as follows. Sec- tion II reviews related work. Section III describes the proposed system architecture. Section IV details the methodology. Sec- tion V covers implementation details. Section VI describes the experimental setup. Section VII presents results and discussion. Section VIII concludes the paper.

  2. RELATED WORK
    1. AI in Athlete Performance Monitoring

      The application of artificial intelligence to athlete data has expanded considerably over the past decade. Chmait and Westerbeek [9] provide a broad survey of ML techniques applied in sport, noting that classification, regression, and clustering methods have been used across domains including injury prediction, performance forecasting, and physiological monitoring. Their analysis highlights that while technical capability is well-established, practical adoption in regulatory contexts – particularly anti-doping – remains limited due to data access and interpretability barriers.

    2. ML Techniques in Doping Detection

      Rahman et al. [4] investigated indirect ML-based detec- tion of erythropoietin (EPO) doping using blood parameters, comparing Random Forest, XGBoost, and SVM classifiers on samples collected at sea level and altitude. Their work demonstrates that ensemble ML methods can detect doping- associated physiological patterns with performance comparable to or exceeding direct laboratory methods, at substantially lower cost. Ryoo et al. [5] demonstrated that AI-driven analysis of athlete performance passports using XGBoost and MLP models can identify doping suspicions with an AUC of 0.790 and F1 of 0.621 on a dataset of 17,058 weightlifting records, confirming that performance-based data can carry useful anomaly signals. Yang et al. [6] extended detection further to non-targeted metabolomics, combining ML with untargeted metabolic profiling to detect novel doping agents not covered by existing prohibited substance lists.

    3. ABP-Based Approaches

      The Athlete Biological Passport remains the regulatory gold standard for longitudinal monitoring. Sottas et al. [2] describe the statistical framework underlying the hematological module of the ABP, which uses Bayesian adaptive models to construct individualized reference intervals. Roinson et al. [3] outline the forensic evidence framework applied when an adverse passport finding is issued, noting that the models low false-positive rate is essential for legal defensibility. Krumm et al. [10] identify opportunities to enhance the ABP by integrating emerging biomarkers and ML-based pattern recognition, particularly for detection of micro-dosing strategies that keep marker values within established limits.

    4. Limitations of Existing Methods

      Despite these advances, existing approaches share common limitations. Supervised ML methods require labelled doping examples that are rare and not publicly available [6]. ABP- based models require months of baseline data collection before producing reliable individualized thresholds [2]. The Isolation Forest method of Liu et al. [7] provides an effective unsupervised alternative but has not been integrated into a com- plete operational anti-doping platform with case management, dashboard visualization, and user access control.

    5. How NADARAI Differs

    NADARAI addresses these gaps by combining a rule- based screening layer with Isolation Forest anomaly detection in a single end-to-end platform. Unlike standalone research models, NADARAI provides a complete operational workflow: CSV data ingestion, preprocessing, hybrid analysis, risk score storage, interactive dashboard, and role-based case review. This makes it directly usable as a decision-support tool by anti- doping analysts rather than requiring custom integration of separate research components.

  3. SYSTEM OVERVIEW
    1. Overall Architecture

      NADARAI follows a three-tier client-server architecture comprising a React-based frontend, a FastAPI backend, and an SQLite database, with an embedded Python-based analysis engine. Fig. ?? presents the overall operational workflow and system components of NADARAI.

      Frontend (React/Vite): The user interface is built with React 18 and Vite, using Recharts for interactive data visualization. It provides role-specific dashboards, CSV upload interfaces, case review screens, and analytical result displays. The frontend communicates with the backend exclusively via RESTful JSON APIs.

      Backend (FastAPI/Uvicorn): The backend is implemented in Python using FastAPI with Uvicorn as the ASGI server. It exposes RESTful endpoints for authentication, athlete data management, CSV upload and parsing, analysis execution, risk score retrieval, and case management. Request validation is handled through Pydantic models, ensuring data integrity at the API layer.

      Database (SQLite/SQLAlchemy): Athlete records, test data, analysis results, risk scores, and case assignments are stored in an SQLite database accessed through the SQLAlchemy ORM. SQLite was chosen for its zero-configuration setup suitable for a prototype, with the architecture designed to allow migration to PostgreSQL for production deployment.

      Analysis Engine: The analysis engine is a Python module embedded in the backend that performs rule-based checks and Isolation Forest anomaly detection on uploaded athlete data, stores the results, and computes final risk scores.

    2. User Roles and Security

      NADARAI implements Role-Based Access Control (RBAC) with two primary user roles:

      Admin: Full access to system configuration, user manage- ment, and all athlete records. Admins can assign cases to analysts and modify system-wide analysis parameters.

      Analyst: Access to upload data, run analysis, view dash- boards, and review assigned cases. Analysts cannot modify other users accounts or system configuration.

      Authentication is implemented using OAutp with JWT (JSON Web Token) bearer tokens. Passwords are hashed using BCrypt before storage. Token expiry and role-based endpoint guards are enforced at the API layer, ensuring that unauthorized access to sensitive athlete data is prevented.

      Fig. 1. Operational workflow and technology stack of the NADARAI platform. Thresholds shown in the rule-based checks are prototype screening heuristics for synthetic-data experimentation and do not represent official ABP decision limits. Isolation Forest anomaly scores are normalized within each uploaded batch before computing the final composite risk score.

    3. Workflow

      The end-to-end operational workflow of NADARAI is as follows:

      1. Login: User authenticates via the login screen; the backend issues a signed JWT token.
      2. CSV Upload: The analyst uploads a CSV file containing athlete test records.
      3. Data Storage: The backend parses, validates, and stores the records in the database.
      4. Analysis: The analysis engine applies rule-based checks and runs Isolation Forest on the stored records.
      5. Risk Score Generation: A composite risk score is computed for each athlete and saved.
      6. Dashboard: The analyst views risk scores, trends, and flagged cases through the interactive dashboard.
      7. Case Review: High-risk cases are escalated, assigned to analysts, and tracked through the case management module.
  4. METHODOLOGY
    1. Data Preparation

      Athlete test records are provided as CSV files with columns including athlete ID, test date, hemoglobin (Hb), hematocrit (Hct), reticulocyte percentage (%Ret), testosterone level (T), testosterone-to-epitestosterone ratio (T/E), and test frequency within a rolling 30-day window.

      On upload, the backend performs the following preprocessing steps: (i) column name normalization and type casting; (ii) removal of records with missing mandatory fields; (iii) clipping of extreme outliers beyond five standard deviations of the population mean, to prevent single corrupted records from distorting the anomaly detection model; (iv) chronological sorting of records per athlete to enable temporal analysis.

    2. Rule-Based Analysis

      The rule-based module applies four deterministic checks to each test record, each producing a binary flag (0 = pass, 1 = fail). These rules are implemented as prototype screening heuristics for synthetic-data experimentation and do not represent official ABP decision limits:

      R1 – Hemoglobin Screening Heuristic (Prototype):

      (

      1 if Hb > 17.0 g/dL

      TABLE I

      Risk Score Classification Bands

      Risk Score Label Action

      0.00 – 0.39 Low No immediate action
      0.40 – 0.69 Medium Flag for analyst review
      0.70 – 1.00 High Escalate for expert review

      TABLE II

      NADARAI Technology Stack

      Component Technology

      Frontend framework React 18 + Vite Charts Recharts

      Backend framework FastAPI + Uvicorn Data validation Pydantic

      Database SQLite

      ORM SQLAlchemy

      Authentication OAutp + JWT (python-jose)

      f Hb =

      1. otherwise

        (1)

        Password hashing BCrypt (passlib)

        ML library Scikit-learn (Isolation Forest)

        R2 – Testosterone T(hreshold:

        f T/E = 1 if T/E ratio > 4.0

        Data processing Pandas, NumPy

        rule-based score, anomaly scores are normalized to the range

        [0, 1] within the uploaded batch using minmax scaling:

        sIF min(sIF)

        fret =

      0 otherwise

      R4 – Test Frequency:

      (

      ffreq = 1 if tests in 30 days < 2

      0 otherwise

      The total rule score for a record is the sum of all flags:

      (4)

      where is a small constant (e.g., 109) added to avoid division by zero when all scores are identical.

      1. Risk Score Computation

      The final composite risk score for each athlete record is computed by combining the normalized rule score and the ML anomaly score using a weighted average:

      Srule = fHb + fT/E + fret + ffreq (5)

      This score ranges from 0 to 4, where higher values indicate greater rule-based suspicion.

    3. Machine Learning Model – Isolation Forest

    Isolation Forest [7] was selected as the ML anomaly detection model for the following reasons: (i) it is unsupervised and requires no labelled doping examples; (ii) it scales linearly with dataset size; (iii) it is robust to high-dimensional feature spaces; and (iv) it has been shown to outperform density-based methods such as LOF on biological benchmark datasets [8].

    The model is trained on the full uploaded dataset using the following feature set: Hb, Hct, %Ret, T, T/E ratio, and the inter-test deltas Hb, %Ret. The contamination parameter is set to 0.05, reflecting the assumption that approximately 5% of records in any batch may represent anomalous profiles. The number of estimators is set to 100, with a maximum sample size of min(256, n) where n is the number of records.

    Each record receives a raw anomaly score sIF from the Isolation Forest decision function, where lower values indicate more anomalous observations. To combine this output with the

    R = · Srule + (1 ) · sIF (7) where Srule = Srule/4 is the normalized rule score in [0, 1], and = 0.5 assigns equal weight to rule-based and ML-based

    evidence.

    The resulting risk score R [0, 1] is classified into three bands as shown in Table I:

  5. IMPLEMENTATION DETAILS

    Table II summarizes the technology stack used in the NADARAI prototype.

    The project is structured as two independent services: a

    /frontend React application and a /backend FastAPI application. The backend exposes the following primary API endpoint groups: /auth (login, token refresh), /athletes (CRUD for athlete records), /upload (CSV ingestion),

    /analysis (trigger and retrieve analysis results), /scores (risk score queries), and /cases (case assignment and tracking).

    Completed modules as of the current prototype stage include: secure login and RBAC, athlete data management, CSV upload and parsing, rule-based analysis engine, Isolation Forest

    anomaly detection, risk score computation and storage, and the main dashboard with trend charts and risk overview.

    Modules currently in progress include: case assignment with RBAC-based routing, manual athlete data entry, configurable analysis thresholds, and performance optimization of database queries for larger datasets.

  6. DATASET AND EXPERIMENTAL SETUP
    1. Dataset

      Official NADA or WADA athlete biological records are not publicly available due to athlete privacy regulations and the legal sensitivity of testing data. Consequently, all experiments were conducted using synthetic datasets generated specifically for this project.

      Synthetic records were constructed with the following param- eter ranges, constructed to approximate typical physiological values and screening patterns reported in ABP literature [2]:

      • Hemoglobin: 13.5 16.5 g/dL (normal); > 17.0 g/dL (suspicious)
      • Hematocrit: 40% 50% (normal); > 50% (suspicious)
      • Reticulocyte %: 0.5% 2.0% (normal); jump > 1.5%

        between consecutive tests (suspicious)

      • T/E ratio: < 4.0 (normal); > 4.0 (suspicious)
      • Test frequency: 2 tests per 30 days (normal); < 2 tests (suspicious)

        Two categories of synthetic profiles were created: (i) Normal profiles – all parameters within established physiological ranges with natural random variation (Gaussian noise); and (ii) Suspicious profiles – one or more parameters deliberately set above threshold limits, or inter-test jumps injected to simulate micro-dosing patterns.

    2. Experimental Setup

    All experiments were run on a standard development machine (Intel Core i5, 8 GB RAM, Windows 11) running Python 3.11 and Node.js 18. The Isolation Forest model was implemented using Scikit-learn 1.4.

    Two test scenarios were evaluated:

    Scenario A – Normal vs. Suspicious Detection: A dataset of 200 records was constructed comprising 170 normal and 30 suspicious profiles. The system was run end-to-end from CSV upload to risk score output.

    Scenario B – Borderline Cases: A dataset of 50 records was constructed where parameter values were set close to but within threshold limits, to evaluate the systems ability to detect subtle anomalies through the ML component rather than rule flags alone.

  7. RESULTS AND DISCUSSION
    1. Risk Score Outputs

      In Scenario A, the system correctly assigned High risk scores (R > 0.70) to 27 of the 30 injected suspicious profiles, and Low risk scores (R < 0.40) to 163 of the 170 normal profiles. Table III summarizes the detection outcomes.

      TABLE III

      Detection Results – Scenario A (200 Records)

      Profile Type Total Correctly Classified
      Normal (Low risk) 170 163 (95.9%)
      Suspicious (High risk) 30 27 (90.0%)
      Overall 200 190 (95.0%)
    2. Anomaly Detection – Borderline Cases

      In Scenario B, the Isolation Forest component successfully flagged 38 of 50 borderline records as Medium or High risk (R 0.40), despite none of the records triggering any rule- based flags. This demonstrates the value of the ML component in identifying suspicious patterns that fall within established thresholds – the primary weakness of pure rule-based anti- doping systems.

    3. Dashboard Visualization

      The dashboard displays athlete risk scores sorted in descend- ing order, trend charts for key biological parameters over time, and a case queue for flagged athletes. In user testing with the development team, the dashboard reduced the time needed to identify the top 10 highest-risk athletes from a 200-record dataset to under 30 seconds, compared to estimated manual review time of several hours.

    4. Observations and Limitations

    Several observations were noted during testing. First, the contamination parameter of the Isolation Forest ( = 0.05) was found to be sensitive in small datasets; batches with fewer than 50 records produced noisier anomaly scores. Second, the rule- based component alone achieved 88.5% overall accuracy, while the hybrid system improved this to 95.0%, confirming the added value of the ML layer. Third, the current prototype does not implement longitudinal baseline tracking across multiple CSV uploads; each upload is analyzed independently, limiting the systems ability to detect gradual drift in an athletes parameters over time.

  8. CONCLUSION AND FUTURE WORK

This paper presented NADARAI, a web-based anti-doping intelligence platform that combines deterministic rule-based checks with Isolation Forest unsupervised anomaly detection to generate composite risk scores for athlete biological test records. The system is implemented as a functional prototype using React, FastAPI, SQLite, and Scikit-learn, with JWT-based authentcation and Role-Based Access Control for secure multi- user operation.

The key contributions of NADARAI are: a hybrid risk scoring methodology that extends pure threshold-based de- tection with ML anomaly detection; a complete end-to-end operational workflow from data ingestion to case review; and a practical platform architecture that can serve as a reference implementation for AI-assisted anti-doping systems.

Experimental results on synthetic datasets demonstrate 95% overall classification accuracy, with the ML component con- tributing meaningfully to the detection of borderline anomalies that rule checks alone would miss.

Current limitations include reliance on synthetic data for testing, the absence of longitudinal cross-upload baseline tracking, and incomplete case assignment and manual data entry features that remain in development.

Future work will focus on the following directions:

  • Real data validation: Collaboration with authorized anti-doping bodies to access real athlete datasets under data governance agreements, enabling proper external validation of the risk model.
  • Expanded parameter set: Incorporation of additional ABP parameters including serum ferritin, OFF-score, and LH levels to improve detection coverage.
  • Longitudinal tracking: Implementation of persistent athlete baseline models that update with each new test upload, enabling detection of gradual physiological drift characteristic of micro-dosing.
  • Improved ML models: Evaluation of alternative models including One-Class SVM, autoencoders, and LSTM- based temporal anomaly detectors.
  • Explainable AI: Integration of SHAP-based feature importance explanations to provide case-level justifications for risk scores, improving analyst trust and regulatory defensibility.
  • Production deployment: Migration from SQLite to Post- greSQL, containerization using Docker, and deployment to a cloud environment for real-world use.

ACKNOWLEDGEMENT

The authors sincerely thank Prof. Shubhangi Gulhane for her invaluable guidance, constant encouragement, and dedicated support throughout the course of this work. Her technical insights and constructive feedback were instrumental in shaping this project.

REFERENCES

  1. World Anti-Doping Agency (WADA), World Anti-Doping Code 2021, WADA, Montreal, Canada, 2021. [Online]. Available: https://www.wada-ama.org/en/resources/world-anti-doping-program/ world-anti-doping-code [Accessed: Feb. 2026].
  2. P.-E. Sottas, N. Robinson, O. Rabin, and M. Saugy, The athlete biological passport, Clinical Chemistry, vol. 57, no. 7, pp. 969-976, Jul. 2011. doi: 10.1373/clinchem.2011.162271.
  3. N. Robinson, M. Saugy, A. Vernec, and P.-E. Sottas, The athlete biological passport: an effective tool in the fight against doping, Clinical Chemistry, vol. 57, no. 6, pp. 830-832, Jun. 2011. doi: 10.1373/clinchem.2011.162107.
  4. M. R. Rahman, J. Bejder, T. C. Bonne, A. B. Andersen, J. R. Huertas,

    R. Aikin, N. B. Nordsborg, and W. Maaß, Detection of erythropoietin in blood to uncover doping in sports using machine learning, in Proc. 2022 IEEE Int. Conf. Digital Health (ICDH), Barcelona, Spain, Jul. 2022,

    pp. 193-201. doi: 10.1109/ICDH55609.2022.00038.

  5. H. Ryoo, S. Cho, T. Oh, Y. Kim, and S.-H. Suh, Identification of doping suspicions through artificial intelligence-powered analysis on athletes performance passport in female weightlifting, Frontiers in Physiology, vol. 15, p. 1344340, Jun. 2024. doi: 10.3389/fphys.2024.1344340.
  6. Q. Yang, W. Xu, X. Sun, Q. Chen, and B. Niu, The application of machine learning in doping detection, Journal of Chemical Informa- tion and Modeling, vol. 64, no. 23, pp. 8673-8683, Nov. 2024. doi: 10.1021/acs.jcim.4c01234.
  7. F. T. Liu, K. M. Ting, and Z.-H. Zhou, Isolation forest, in Proc. 8th IEEE Int. Conf. Data Mining (ICDM), Pisa, Italy, Dec. 2008, pp. 413-422. doi: 10.1109/ICDM.2008.17.
  8. F. T. Liu, K. M. Ting, and Z.-H. Zhou, Isolation-based anomaly detection, ACM Transactions on Knowledge Discovery from Data, vol. 6, no. 1, pp. 1-39, Mar. 2012. doi: 10.1145/2133360.2133363.
  9. N. Chmait and H. Westerbeek, Artificial intelligence and machine learning in sport research: an introduction for non-data scientists, Frontiers in Sports and Active Living, vol. 3, p. 682287, Dec. 2021. doi: 10.3389/fspor.2021.682287.
  10. B. Krumm, F. Botrè, J. J. Saugy, and R. Faiss, Future opportunities for the athlete biological passport, Frontiers in Sports and Active Living, vol. 4, p. 986875, Nov. 2022. doi: 10.3389/fspor.2022.986875.