MLOps: Changes and Tools - Adapting Machine Learning Workflows for Enhanced Efficiency and Scalability

Mysaa Khaled Alrahal

doi:https://doi.org/10.5281/zenodo.18075857

Volume 14, Issue 10 (October 2025)

MLOps: Changes and Tools – Adapting Machine Learning Workflows for Enhanced Efficiency and Scalability

DOI : https://doi.org/10.5281/zenodo.18075857

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 78
Authors : Mysaa Khaled Alrahal, Dr. Sira Astour
Paper ID : IJERTV14IS100158
Volume & Issue : Volume 14, Issue 10 (October 2025)
Published (First Online): 12-11-2025
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

MLOps: Changes and Tools – Adapting Machine Learning Workflows for Enhanced Efficiency and Scalability

Mysaa Khaled Alrahal
Syrian Virtual University

Dr. Sira Astour
Arab International University

Abstract – This study presents an analytical and practical investigation of Machine Learning Operations (MLOps) processes, with a particular focus on experiment tracking and reproducibility. An open-source, end-to-end MLOps pipeline was proposed, encompassing data preparation, training, tracking, deployment, and preliminary post-production monitoring. The research utilized DVC, MLflow, and Weights & Biases (W&B) to ensure proper versioning and logging of configurations and metrics and compared their performance to conventional non-automated workflows. A CI/CD pipeline was implemented using GitHub Actions, with automated deployment through Docker and FastAPI.

Experimental results demonstrated that integrating tracking and automation tools significantly improved model accuracy, reduced training and deployment time, and minimized operational errors — ultimately enhancing model reliability in both academic and production environments. Keywords: MLOps, MLOps Pipeline, Experiment Tracking, CI/CD, DevOps, Versioning, Machine Learning, Artificial Intelligence

INTRODUCTIONIn the era of rapid digital transformation and the growing reliance on Artificial Intelligence (AI) and Machine Learning (ML) technologies in both research and industrial applications, ensuring the efficiency and sustainability of model life cycles has become increasingly critical. [1], [2] Despite remarkable progress in algorithmic development and accuracy enhancement, significant challenges persist in data management, experiment tracking, and deployment in production environments. [3], [4]To address these issues, Machine Learning Operations (MLOps) has emerged as an extension of DevOps practices, integrating data management, automation, and continuous model monitoring to achieve transparency, reproducibility, and scalability. [2], [3] Nevertheless, as reported in recent studies, several persistent gaps and implementation challenges remain — particularly in aligning tools, workflows, and team collaboration — which necessitate deeper analysis and applied research. [1], [5], [6]Although AI and ML have advanced considerably, most traditional research and application environments still face fundamental limitations. These include difficulties in managing datasets and models, the absence of effective version control mechanisms, low operational efficiency due to manual training and deployment processes, and poor scalability when models are transferred across multiple servers or cloud environments. [2], [3], [4] Furthermore, experiment reproducibility often suffers; when code or data change, prior results become irreproducible, undermining scientific reliability. [3], [7] Another recurring issue is the lack of integration between development and data analysis teams, leading to the common “works on my machine” problem. [2], [3]Manual model deployment also remains prevalent in academic projects — training and evaluation are typically completed, but deployment is either manual or neglected entirely, resulting in high error rates and delayed updates. [8] In production environments, the gap between development and operational contexts often leads to failures when models are transitioned from one environment to another. [3] Moreover, post-deployment monitoring is frequently overlooked, causing gradual degradation in model performance due to data drift and changing operational conditions. [9]These challenges have motivated numerous researchers to enhance MLOps practices and improve interoperability between tools that often differ in design and configuration. [8], [6] This study contributes to these efforts by focusing specifically on experiment tracking
— a central yet underexplored component of the MLOps workflow — and by analyzing its role in ensuring transparency, reproducibility, and scientific reliability. [10], [11]
The main objective of this research is to evaluate the impact of MLOps techniques on improving the efficiency of ML model life cycles by proposing and implementing a fully integrated and open-source pipeline that supports reproducibility, traceability, and automated deployment. The study compares open-source tools such as DVC, MLflow, and Weights & Biases (W&B) in terms of experiment tracking and model performance analysis, benchmarking them against conventional non-automated workflows.

This research fills an important gap by highlighting the relationship between experiment tracking and subsequent stages in the MLOps life cycle. It provides a practical, modular framework that can be leveraged in both academic and industrial contexts to enhance model efficiency, scalability, and transparency.

Empirically, the study is among the first to conduct a comparative evaluation of two open-source tracking tools — MLflow and W&B — across textual (TF-IDF + Logistic Regression) and image-based (CNN–CIFAR10) experiments. Results revealed comparable statistical performance (Accuracy and F1-score) across both environments, with MLflow demonstrating faster execution (≈11 s for text, ≈144 s for images) due to local storage efficiency, while W&B offered superior collaborative and analytical capabilities despite higher synchronization latency (≈26–42 s for text, ≈216 s for images).

The integration of these tools with CI/CD pipelines using GitHub Actions, Docker, and FastAPI successfully enabled an end-to-end MLOps workflow — from data preparation and training to automated deployment and post-deployment monitoring. Consequently, the proposed framework addresses a key research gap by emphasizing the operational value of experiment tracking and providing a reproducible, extensible MLOps model suitable for real-world use.

RELATED WORKMLOps practices are rapidly evolving from isolated laboratory prototypes into production-grade, socio-technical ecosystems that intersect with data tooling, cloud platforms, governance, and regulatory compliance. Early work foregrounded sustainability and long-term maintainability as first-class concerns in AI software, warning that unchecked platform complexity and technical debt can undermine both social and organizational sustainability. [12] Subsequent systematic mappings consolidated best practices, recurring challenges, and maturity models, revealing the need for standardization across the MLOps lifecycle and for reliable mechanisms that ensure end-to-end traceability and reproducibility. [9]To bridge academic and industrial viewpoints, Steidl et al. provided a unifying lens over terms such as AI-DevOps, CI/CD for AI, MLOps, model lifecycle management, and CD4ML, and organized triggers and pipeline stages into a coherent hierarchy (data processing, model learning, software development, and system operations). [13] Complementing process-centric views, Faubel and Schmid systematically analyzed MLOps platforms and architectural tactics using attribute-driven design to map capabilities to platform choices. [14] A broader landscape study further compared 16 widely used platforms with a three-step evaluation, yielding decision guidance for end-to-end versus specialized stacks and highlighting gaps in production monitoring and drift detection. [15] In healthcare, a scoping review distilled thematic areas that stress continuous monitoring, re-training, ethics, clinical integration, infrastructure, compliance, and feasibility. [16] Large-scale geospatial pipelines demonstrated how container orchestration (Kubernetes) enables scalable, repeatable updates over massive satellite imagery. [17]Experiment tracking remains pivotal: Budras et al. surveyed 20 tools and analyzed four in depth. [10] Matthew highlighted versioning and reproducibility at scale. [7] Safri, Papadimitriou, and Deelman argued for dynamic tracking integrated with workflow systems.[11] Applied templates that codify provenance, logging, configuration management, and CI/CD offer pragmatic on-ramps. [18] Collectively, prior work establishes the need for standardization and sustainability, shared vocabulary and staged pipelines, architecture-level guidance, domain constraints, and the centrality of tracking. Yet a quantitative link between tracking choices and operational outcomes within a single, reproducible pipeline is still scarce.

Table 1. Comparative positioning against Matthew (2025) [7]

Dimension	Matthew (2025) [7]	This Study
Methodology	Descriptive/theoretical framing of versioning & reproducibility with high-level examples.	Field implementation of a fully reproducible, open-source pipeline (DVC, MLflow, W&B, GitHub Actions, Docker, FastAPI).
Application Scope	Broad focus on large-scale industrial projects; no concrete experimental pipeline.	Dual case study (text + images) to demonstrate generality across data modalities.
Focus	Version control & reproducibility as foundational concerns.	Practical impact of experiment tracking within an integrated lifecycle covering versioning, deployment, and monitoring.
Tooling	Discusses tools primarily as exemplars.	Quantitative head-to-head analysis of MLflow vs. W&B.
Technical Stack	No implemented CI/CD architecture.	Concrete CI/CD with GitHub Actions and Docker integrated with tracking and serving.
Scientific Value	Organizational guidance for the reproducibility crisis.	Applicable, benchmarked pipeline and measurement framework (operational trade-offs).

PROPOSED MLOPS PIPELINE
Figure 1: pipeline architecture
1. Architectural Overview
  
  We propose an end-to-end, open-source MLOps pipeline spanning five phases—design & planning, experimentation & development, deployment, monitoring & maintenance, and continuous improvement. The pipeline orchestrates DVC stages (text and image), experiment tracking via MLflow and W&B, automated builds/releases with Docker and GitHub Actions, and online serving through FastAPI.
  
  Key assets:
  - DVC stages (text + vision): clean_models, prepare_data_text, train_text_baseline, train_text_mlflow, train_text_wandb, prepare_cifar10, train_cifar10_baseline, train_cifar10_mlflow, train_cifar10_wandb.
  - Serving: FastAPI container on port 80.
  - Tracking: MLflow (local artifact store) and W&B (hosted dashboards).
  - CI/CD: GitHub Actions; Docker Buildx for caching and multi-arch builds.
  - Remote storage: DVC-Remote (SSH) for versioned datasets/models.
2. Five-Phase Workflow (19 Steps)
  
  P1 — Design & Planning
  
  Step 1. Repository checkout. Retrieve the repository and pipeline descriptors (YAML/DVC) as the single source of truth.
  
  P2 — Experimentation & Development
  
  Step 2. Set up Python 3.10. Interpreter pinning for deterministic runs. Step 3. Cache pip. Speeds up repeated CI executions.
  
  Step 4. Install dependencies (+ dvc-ssh). Ensure ML/data/DVC tooling; dvc-ssh enables secure DVC remote access. Step 8. DVC config unset keyfile. Remove stale local keyfile paths to enforce CI SSH credentials.
  
  Step 9. DVC pull. Fetch the exact versioned datasets/models. Step 10. DVC repro. Incrementally reproduce affected stages:
  - Prepare text & image data;
  - Train models under Baseline, MLflow, and W&B backends.
  Step 11. DVC push. Persist newly produced artifacts/models to the remote. Outcome: Trained, versioned models under models/, ready for packaging.
  
  P3 — Deployment
  
  Step 5. Load SSH keys. Secure access to DVC remote and production hosts via GitHub Secrets. Step 6. Add server to known_hosts. Enforce host authenticity; avoid interactive prompts.
  
  Step 7. Test SSH connectivity. Early failure detection.
  
  Step 12. Docker login. Authenticate to the container registry.
  
  Step 13. Setup Docker Buildx. Advanced caching / multi-arch builds.
  
  Step 14. Generate timestamped tag. Publish :latest and a precise time-based tag (e.g., :2025YYYYMMDDhhmmss). Step 15. Build & push image. Bundle code, deps, and models; push to the registry.
  
  Step 16. Check deploy key. Sanity-check presence/permissions before release.
  
  Step 17. SSH deploy (docker run + env vars). Pull & run the new image with required env (incl. W&B); expose FastAPI on port 80. P4 — Monitoring & Maintenance
  
  Step 18. Health check (/health). Verify readiness/liveness post-rollout.
  
  Online telemetry to W&B. Each API inference logs endpoint, prediction, confidence, inference_time for lightweight post-deployment monitoring.
  
  P5 — Continuous Improvement
  
  Step 19. Analyze W&B metrics → optimize → retrain via DVC. If accuracy/latency degrades, adjust and re-run (10) dvc repro to produce a new versioned model—closing the learning loop.
3. DVC Pipeline Stages and Artifacts
  
  Text: prepare_data_text → train_text_baseline | train_text_mlflow | train_text_wandb; Vision: prepare_cifar10 → train_cifar10_baseline | train_cifar10_mlflow | train_cifar10_wandb; Housekeeping: clean_models. Each stage declares inputs/outputs and params enabling selective recomputation and exact provenance.
4. CI/CD and Release Engineering
  
  A single GitHub Actions workflow implements build–test–package–deploy with Python 3.10, dependency caching, and DVC remote integration; Docker release with dual tagging (latest + timestamp); zero-touch rollout via SSH with health verification.
5. Production Telemetry and Early-Warning Signals
  
  FastAPI logs minimal inference metadata to W&B (endpoint, predicted label, confidence, latency) supporting quality monitoring, SLOs, and auditability.
6. Continuous-Learning Loop
  
  Monitoring insights drive targeted retraining. Using DVC versioning, we re-run repro with updated data/params; new artifacts receive fresh hashes and image tags.
7. Reproducibility, Security, and Compliance Notes
  - Reproducibility: pinned interpreter, locked deps, DVC-tracked data/models, immutable images, dual tags.
  - Security: SSH keys in GitHub Secrets; known_hosts pinning; runtime-injected env vars.
  - Governance: verifiable lineage from commit → params → data snapshot → model hash → image tag → production metrics.
EXPERIMENTS AND RESULTS
1. Text Classification (TF-IDF + Logistic Regression)
  
  A labeled text dataset (text, label) was processed with TF-IDF; stratified 80/20 split. Logistic Regression served as the baseline classifier.
  
  Table 2. Text model performance
  
  Tracking
  
  Accuracy
  
  F1-macro
  
  F1-weighted
  
  None
  
  0.855
  
  0.853
  
  0.854
  
  MLflow
  
  0.855
  
  0.853
  
  0.854
  
  W&B
  
  0.855
  
  0.853
  
  0.854
  
  Table 3. Text model timing (train/predict)
  
  Tracking
  
  Train Time (s)
  
  Predict Time (s)
  
  None
  
  ~0.51
  
  ~0.014
  
  MLflow
  
  ~0.64
  
  ~0.015
  
  W&B
  
  ~0.51
  
  ~0.020
  
  Table 4. Text model total runtime
  
  Tracking
  
  Total Runtime (code)
  
  Runtime (tool UI)
  
  None
  
  ~0.55 s
  
  —
  
  MLflow
  
  ~11.5 s
  
  ~10.6 s
  
  W&B
  
  ~26.8 s
  
  ~42 s
2. Image Classification (CNN on CIFAR-10)
  
  CIFAR-10 (60k images, 10 classes) was normalized to [0,1] and stored as NPZ for faster loads. A small CNN was trained under three tracking conditions.
  
  Table 5. Vision model performance
  
  Tracking
  
  Accuracy
  
  Test Loss
  
  None
  
  0.6767
  
  0.9366
  
  MLflow
  
  0.6804
  
  0.9288
  
  W&B
  
  0.6704
  
  0.9528
  
  Table 6. Vision model timing (train/predict)
  
  Tracking
  
  Train Time (s)
  
  Predict Time (s)
  
  None
  
  ~124
  
  ~2.62
  
  MLflow
  
  ~133
  
  ~2.82
  
  W&B
  
  ~146
  
  ~2.77
  
  Table 7. Vision model total runtime
  
  Tracking
  
  Total Runtime (code)
  
  Runtime (tool UI)
  
  None
  
  ~127 s
  
  —
  
  MLflow
  
  ~144 s
  
  ≈ 2.7 min (162 s)
  
  W&B
  
  ~149 s
  
  ≈ 3 min 36 s (216 s)
3. Summary
  
  Predictive metrics were identical across tracking tools, indicating that tracking does not affect model quality when model/data are fixed. MLflow was faster locally with disk-based artifacts; W&B introduced upload latency yet enabled rich, collaborative dashboards.
CONCLUSION AND FUTURE WORK
1. Summary of Findings
  
  This study delivered a reproducible, automated MLOps pipeline using DVC, MLflow, W&B, Docker, FastAPI, and GitHub Actions. Outcomes include reproducible runs, reduced operational burden, collaborative sharing, and CI/CD-integrated deployment. On text and vision tasks, statistical metrics matched across tracking tools; operational trade-offs favored MLflow locally and W&B for collaboration.
2. Future Research Directions
  - Post-deployment drift and bias detection (Fairlearn, AIF360).
  - Explainability integration (SHAP, LIME) linked to tracking runs.
  - Unified tracking middleware (merge MLflow & W&B metadata).
  - Self-adaptive MLOps (auto re-training on monitored degradation).
  - Temporal performance auditing (standardize runtime/latency logs).
  - Multi-modal pipelines (text+image) with dual-tracking assessment.
3. Concluding Remark

Well-structured MLOps practices measurably improve reliability, transparency, and deployment velocity. Choosing the right tool mix early—balancing speed (MLflow) and collaboration (W&B)—is decisive for long-term success.

REFERENCES :

J. Díaz-de-Arcaya, A. I. Torre-Bastida, G. Zárate, R. Miñón, and A. Almeida, “A Joint Study of the Challenges, Opportunities, and Roadmap of MLOps and AIOps: A Systematic Survey,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–30, 2023.
A. Singla, “Machine Learning Operations (MLOps): Challenges and Strategies,” Journal of Knowledge Learning and Science Technology, vol. 2, no. 3, pp. 334–342, 2023.
D. Kreuzberger, N. Kühl, and S. Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866– 31879, 2023.
S. Wazir, G. S. Kashyap, and P. Saxena, “MLOps: A Review,” arXiv preprint arXiv:2308.10908, 2023. [Online]. Available: https://arxiv.org/abs/2308.10908
S. J. Warnett and U. Zdun, “On the Understandability of MLOps System Architectures,” IEEE Trans. Software Eng., vol. 50, no. 5, pp. 1015–1039, May 2024.
D. O. Hanchuk and S. O. Semerikov, “Automating Machine Learning: A Meta-Synthesis of MLOps Tools, Frameworks and Architectures,” in Proc. 7th Workshop for Young Scientists in Computer Science & Software Engineering (CS&SE@SW 2024), CEUR Workshop Proc., vol. 3917, pp. 362–414, 2025.
B. Matthew, “Model Versioning and Reproducibility Challenges in Large-Scale ML Projects,” Advances in Engineering Software, 2025.
P. Liang, B. Song, X. Zhan, Z. Chen, and J. Yuan, “Automating the Training and Deployment of Models in MLOps by Integrating Systems with Machine Learning,” Applied and Computational Engineering, vol. 76, pp. 1–7, 2024.
M. I. Zarour, H. Alzabut, and K. Alsarayrah, “MLOps Best Practices, Challenges and Maturity Models: A Systematic Literature Review,” Information and Software Technology, vol. 183, Article 107733, 2025.
T. Budras, M. Blanck, T. Berger, and A. Schmidt, “An In-depth Comparison of Experiment Tracking Tools for Machine Learning Applications,” International Journal on Advances in Software, vol. 15, no. 3&4, 2022.
H. Safri, G. Papadimitriou, and E. Deelman, “Dynamic Tracking, MLOps, and Workflow Integration: Enabling Transparent Reproducibility in Machine Learning,” in Proc. 20th IEEE Int. Conf. on e-Science, pp. 1–10, 2024.
D. A. Tamburri, “Sustainable MLOps: Trends and Challenges,” in Proc. 22nd Int. Symp. on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2020), 2020, pp. 17–23.
M. Steidl, M. Felderer, and R. Ramler, “The Pipeline for the Continuous Development of Artificial Intelligence Models – Current State of Research and Practice,” Journal of Systems and Software, vol. 192, p. 111615, 2023.
L. Faubel and K. Schmid, “A Systematic Analysis of MLOps Features and Platforms,” in Proc. 2024 Int. Conf. on Data Science and Software Engineering Applications (DSD/SEAA – Works in Progress Session), Aug. 2024
L. Berberi et al., “Machine Learning Operations Landscape: Platforms and Tools,” Artificial Intelligence Review, vol. 58, Article 167, 2025.
A. Rajagopal et al., “Machine Learning Operations in Health Care: A Scoping Review,” Mayo Clinic Proceedings: Digital Health, vol. 2, no. 3, pp. 421–437, 2024.
A. M. Burgueño-Romero, C. Barba-González, and J. F. Aldana-Montes, “Big Data-driven MLOps Workflow for Annual High-Resolution Land Cover Classification Models,” Future Generation Computer Systems, vol. 163, Article 107499, 2025.
R. C. Godwin and R. L. Melvin, “Toward Efficient Data Science: A Comprehensive MLOps Template for Collaborative Code Development and Automation,” SoftwareX, vol. 26, Article 101723, 2024.

Tracking	Accuracy	F1-macro	F1-weighted
None	0.855	0.853	0.854
MLflow	0.855	0.853	0.854
W&B	0.855	0.853	0.854

Tracking	Train Time (s)	Predict Time (s)
None	~0.51	~0.014
MLflow	~0.64	~0.015
W&B	~0.51	~0.020

Tracking	Total Runtime (code)	Runtime (tool UI)
None	~0.55 s	—
MLflow	~11.5 s	~10.6 s
W&B	~26.8 s	~42 s

Tracking	Accuracy	Test Loss
None	0.6767	0.9366
MLflow	0.6804	0.9288
W&B	0.6704	0.9528