Predicting Heart Disease using Machine Learning Algorithms

Asst. Prof. Poonam Pramod Shilwant

doi:10.17577/IJERTCONV14IS020155

NCRTCS - 2026 (Volume 14 – Issue 02)

Predicting Heart Disease using Machine Learning Algorithms

DOI : 10.17577/IJERTCONV14IS020155

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 37
Authors : Asst. Prof. Poonam Pramod Shilwant
Paper ID : IJERTCONV14IS020155
Volume & Issue : Volume 14, Issue 02, NCRTCS – 2026
Published (First Online) : 21-04-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Predicting Heart Disease using Machine Learning Algorithms

Asst. Prof. Poonam Pramod Shilwant

Computer Science Department

Dr. D. Y. Patil Arts, Commerce and Science College Akurdi, Pune 411044, India

Abstract

Heart disease remains one of the leading causes of mortality worldwide, including India. Early prediction and timely diagnosis can significantly reduce fatal outcomes. This research paper proposes a machine learningbased predictive model for detecting heart disease using clinical and demographic attributes. Various supervised learning algorithms such as Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) were implemented and compared. The study uses publicly available heart disease datasets and evaluates performance using accuracy, precision, recall, and F1-score metrics. Experimental results indicate that ensemble-based approaches such as Random Forest provide higher predictive accuracy compared to traditional classifiers. The findings demonstrate the potential of machine learning systems in assisting healthcare professionals for early heart disease risk assessment in the Indian healthcare context.

KKeeywords:

Heart Disease Prediction, Machine Learning, Random Forest, Healthcare Analytics, Supervised Learning.

Introduction

Cardiovascular diseases (CVDs) encompass disorders of the heart and blood vessels including coronary heart disease, stroke, and peripheral arterial disease. CVDs are globally responsible for millions of fatalities annually. In India, the burden of heart disease has significantly increased over recent decades, with estimates indicating that around 11% of the adult population has some form of CVD.

According to the World Health Organization, cardiovascular conditions account for approximately one-third of all deaths in India. The prevalence is higher in urban areas compared to rural regions. Additionally, lifestyle risk factors such as tobacco use, high blood pressure, diabetes, and high dietary salt intake contribute to rising disease incidence.

Machine learning offers powerful tools to analyze complex healthcare datasets and support early disease detection. This paper

Literature Review

Cardiovascular diseases

(CVDs), particularly heart disease, have emerged as one of the most significant public health challenges in India over the last few decades. The confluence of demographic transition, lifestyle changes, genetic predisposition, and low awareness of risk factors has contributed to a rapidly increasing disease burden (Prabhakaran, Jeemon, & Roy, 2016). Indias situation is particularly grave, as it faces the dual burden of high prevalence and premature mortality associated with CVDs compared to many developed countries (ICMR, PHFI, & IHME, 2017;

India State-Level Disease Burden Initiative CVD Collaborators, 2018).

The growing trend of heart disease in India is evident from epidemiological studies and national surveys. For instance, national data from the National Health Family Survey

5 (NFHS-5) indicates that the prevalence of hypertension, a significant risk factor for heart disease, has increased among adults across states, with higher prevalence in urban areas (Ministry of Health and Family Welfare, 2021). This pattern is reinforced by the Health in India report from the National Statistical Office (2023), which documents increasing rates of obesity, diabetes, and

compares multiple supervised learning algorithms to identify the most promising model for heart disease prediction.

hypertension all critical contributors to cardiovascular conditions. The NITI Aayog Health Index Report (2021) further highlights the inter-state variations in disease burden and preventive healthcare infrastructure, suggesting that states with lower health system performance also show higher instances of non- communicable diseases such as heart disease (NITI Aayog, 2021).

Epidemiological evidence suggests that Indians may develop cardiovascular risk factors at younger ages compared to Western populations, possibly due to genetic predispositions combined with rapid urbanization and lifestyle shifts (Gupta, Mohan, & Narula, 2016). Earlier research among urban South Indian populations reported a coronary artery disease prevalence rate of up to 7.4% among adults aged 2564 years (Mohan et al., 2001). Such findings underscore the need for robust predictive tools that can help clinicians identify individuals at high risk for heart disease early in the disease trajectory.

Traditional methods of cardiovascular risk assessment have relied on clinical scores such as the Framingham Risk Score. However, these tools often demonstrate limited predictive power when applied to

Indian populations due to underlying differences in genetic makeup, lifestyle factors, and disease patterns (Prabhakaran et al., 2016). With the availability of large healthcare datasets and advancements in computational power, machine learning (ML) has been proposed as a promising alternative for prediction-based disease analytics.

Machine learning refers to a set of computational algorithms that learn patterns from data and use those patterns to make predictions or decisions without explicit programming for each task. In the context of disease prediction, supervised learning models such as Logistic Regression, Decision Trees, Support Vector Machines (SVM), Random Forest, and K-Nearest Neighbors (KNN) have been commonly applied (Singh & Singh, 2018; Sharma & Sharma, 2019). These algorithms can effectively handle complex interactions between multiple predictor variables including age, sex, blood pressure, cholesterol levels, and lifestyle attributes which often define cardiovascular risk profiles.

Several Indian studies have explored ML models for heart disease prediction. Soni, Ansari, Sharma, and Soni (2011) conducted one of the early explorations of predictive data mining in the medical domain, demonstrating that tree-based algorithms could classify heart disease risk with

reasonable accuracy. This study laid the groundwork for subsequent ML research in healthcare prediction within the Indian context. Building upon these efforts, Dinesh Kumar and Arumugam (2012) compared multiple algorithms including Decision Trees and ANN (Artificial Neural Networks) and reported promising prediction accuracy, suggesting that data-driven models could augment clinical decision-making.

Subsequent Indian research continues to validate the role of ML in improving heart disease detection and prediction. Sharma and Sharma (2019) implemented several ML algorithms, including Logistic Regression and Random Forest, on Indian datasets and achieved an accuracy exceeding traditional statistical methods. Their findings reinforce that ensemble approaches, particularly Random Forest, often outperform simpler classifiers due to their ability to reduce variance and handle heterogeneous data distributions common in medical records. Similarly, Singh and Singh (2018) applied multiple machine learning models and demonstrated that advanced algorithms such as SVM and Random Forest provided significantly higher classification performance compared to basic predictive techniques.

While these studies highlight the potential of machine learning, they also emphasize certain limitations. Many Indian research efforts are constrained by small sample

sizes, limited diversity of predictor variables, and reliance on a single dataset (often the UCI Heart Disease dataset). This limits generalizability, especially when translating predictive models to real-world Indian clinical environments. Moreover, most stdies do not account for socioeconomic, environmental, and regional factors which can influence heart disease risk among diverse Indian populations. These gaps indicate that future work should focus on larger, nationally representative datasets that capture the breadth of Indian demographic, clinical, and lifestyle profiles.

Beyond predictive accuracy, there is also growing concern regarding model interpretability and clinical applicability. Clinicians are often reluctant to adopt black box models that provide little insight into how predictions are generated. Explainable AI (XAI) techniques and interpretable ML models can address these concerns by highlighting the influence of specific risk factors and decision paths, thereby improving clinician trust and practical utility (Karthikeyan & Pais, 2020). Research in this direction is nascent in India but holds promise for making machine learning solutions more acceptable in clinical settings.

The national policy framework also underscores the importance of early detection and prevention. The National Programme for Prevention and Control of

Cancer, Diabetes, Cardiovascular Diseases and Stroke (NPCDCS) operational guidelines (2022) outline strategic priorities for strengthening non-communicable disease screening and management services across India (National Health Systems Resource Centre, 2022). Integrating ML-based decision support into such programs could enhance screening efficiency, optimize resource allocation, and allow health workers to identify high-risk individuals earlier.

Government health surveys reveal important patterns useful for model refinement. NFHS-5 data demonstrates significant urbanrural disparities in hypertension and diabetes prevalence key risk factors of heart disease. For example, hypertension prevalence is higher in urban regions compared to rural environments, likely due to differences in diet, physical activity, and stress levels (Ministry of Health and Family Welfare, 2021). These insights can inform additional features in ML models, such as urban lifestyle indicators, socioeconomic status, education, and access to healthcare enhancing model sensitivity and specificity for Indian populations.

Finally, global evidence also supports machine learnings relevance in heart disease prediction, but Indian research uniquely highlights the need for localized models. A globally optimized model may not perform adequately in the Indian context

due to demographic, clinical, and environmental heterogeneity. Hence, a tailored, India-centric ML approach integrating regional and population-specific risk factors is essential to achieve optimal predictive performance and healthcare outcomes.

In summary, the literature demonstrates the increasing burden of cardiovascular disease in India, the limitations of traditional risk prediction methods, and the promise of machine learning techniques for enhanced prediction accuracy. Indian research has

validated the utility of ML models, particularly ensemble and nonlinear classifiers, in detecting heart disease, but challenges remain regarding dataset diversity, model interpretability, and integration into clinical workflows. Addressing these gaps through larger

datasets, feature enrichment, and explainable models will be critical to realizing machine learnings full potential in improving heart disease outcomes in India.

Summary of Indian Research on Heart Disease Prediction Using Machine Learning

Sr. No .	Authors & Year	Dataset Used	Machine Learning Methods	Sample Size	Reported Accuracy	Key Findings
1	Soni et al., 2011	UCI Heart Disease Dataset	Decision Tree, Naïve Bayes	303 records	~85% (Decision Tree)	Decision Tree performed better for structured clinical data.
2	Dinesh Kumar & Arumugam, 2012	UCI Dataset	ANN, Decision Tree	270300 records	~8386%	ANN showed slightly better performance than basic classifiers.
3	Sharma & Sharma, 2019	UCI + Indian Hospital Data	Logistic Regression, Random Forest, SVM	300+ records	~88% (Random Forest)	Ensemble methods improved prediction reliability.
4	Singh & Singh, 2018	UCI Dataset	SVM, KNN, Random Forest	303 records	~87% (SVM)	SVM performed well with

						nonlinear feature boundaries.
5	Karthikeyan & Pais, 2020	Indian Clinical Dataset	Random Forest, Logistic Regression	500+ records	~89%	Emphasized explainability and risk factor importance.
6	India State-Level CVD Study (2018)	National Survey Data	Statistical + ML modeling	Large national dataset	Not accuracy-bas ed	Highlighted increasing burden & need for predictive systems.
7	ICMR Report (2017)	National Epidemiolo gical Data	Statistical Modeling	Multi-state dataset	Policy-orient ed	Emphasized early detection importance in India.

Accuracy Comparison of Indian ML Heart Disease Studies using graph

Hypothesis

H0 (Null Hypothesis): Machine learning algorithms do not significantly improve the accuracy of heart disease prediction compared to traditional statistical methods.

H1 (Alternative Hypothesis): Machine learning algorithms significantly improve the accuracy of heart disease prediction compared to traditional statistical methods.
Methodology

Predicting Heart Disease using Machine Learning Algorithms

Abstract

KKeeywords:

Heart Disease Prediction, Machine Learning, Random Forest, Healthcare Analytics, Supervised Learning.

Literature Review

Accuracy Comparison of Indian ML Heart Disease Studies using graph

Hypothesis

Methodology

Dataset