A Study on Heart Disease Prediction using Different Classification Models based on Cross Validation Method

Anirban Ghosh; Sushovon Jana

doi:10.5281/zenodo.18439934

Volume 11, Issue 06 (June 2022)

A Study on Heart Disease Prediction using Different Classification Models based on Cross Validation Method

DOI : 10.5281/zenodo.18439934

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 1,712
Authors : Anirban Ghosh , Sushovon Jana
Paper ID : IJERTV11IS060029
Volume & Issue : Volume 11, Issue 06 (June 2022)
Published (First Online): 09-06-2022
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Study on Heart Disease Prediction using Different Classification Models based on Cross Validation Method

Anirban Ghosh Department of Statistics University of Kalyani West Bengal, India

Sushovon Jana

Department of Applied Statistics

Maulana Abul Kalam Azad University of Technology West Bengal, India

Abstract Heart disease causes the greatest number of deaths in world. A large number of people cannot recognize it in early stage. In this study, our goal is to find a good model for prediction of heart disease. The dataset consists of 918 observations, out of which, 508 have heart disease and 410 are normal. To find the best model, we compare five classification models i.e., Logistic Regression model, Support Vector Machine, Random Forest model, NaÃ¯ve Bayes classifier and Linear Discriminant Analysis to predict if a person has heart disease or not. We compare the models using 10-fold cross-validation method with three repetitions. The study proposes Random Forest model as the most appropriate predictor of heart disease mean accuracy of 86.93%, which is the highest among all models. The slope of the peak exercise ST segment is the most important subject to predict heart disease. Old peak, type of chest pain and maximum heart rate achieved are also important for predicting heart disease.

Keywords Heart disease; Logistic Regression; Support Vector Machine; Random Forest Model; NaÃ¯ve Bayes Classifier; Linear Discriminant Analysis; Cross Validation

INTRODUCTION

According to World Health Organization (WHO), heart disease is the no. 1 cause of death in world. It is responsible for 16% of total deaths in world [1]. Since 2000, the largest increase in deaths has been for heart disease, rising by more than 2 million to 8.9 million deaths in 2019 [1]. Also in India, heart disease is the leading cause of death. According to Global Burden of Disease, 24.8% of all deaths in India is due to heart disease [2]. Heart disease may happen for various reasons. Most common heart disease is coronary artery disease, which happens due to building up of fatty plaques in arteries (atherosclerosis). Heart disease can show various symptoms like chest pain, suffocation, weakness and many more according to the type of heart disease. It can be prevented by maintaining proper diet, following healthy lifestyle, doing regular exercise etc. Though a great amount of statistical and scientific researches is being done, heart disease continues to be the largest killer of world. By early detection of heart disease and proper treatment, chance of survival of a heart disease patient can be increased.

We have analyzed a dataset of 918 observations containing 11 independent variable and whether there is heart disease or not. Through VIF calculation and Principal Component Analysis, we have found that no significant multicollinearity exists among the variables. So, we have fitted some

classification models to predict heart disease of a person and compared the accuracy of different models. We have used R programming language as a tool for these purposes.
RELATED WORKS

In this study, a comparative analysis has been done among various Machine Learning classification algorithms. Random Forest becomes the best model among these models.

Reference [4] used Random Forest model for prediction of heart disease. They obtained accuracy of 86.9% with sensitivity value 90.6% and specificity value 82.7%.

Reference [5] authors have proposed a Logistic Regression model for Diabetes prediction by integrating PCA and K- means techniques. This model shows high accuracy of 97.40%.

Reference [6] obtained the slope of peak exercise ST segment, old peak, chest pain type etc. as significant subject for predicting heart disease. This study provided a significant contribution in computing strength scores with significant predictors in heart disease prediction.

Reference [7] compared four classification algorithms i.e., NaÃ¯ve bayes, random forest, Linear regression, Decision tree to predict the heart disease. Among these algorithms Random Forest gives best accuracy of 90.16% compared to other algorithms.

Reference [8] authors proposed hybrid Random Forest model for prediction of cardiovascular disease. The study showed accuracy of 88.7% for prediction of CVD using the proposed model.

METHODOLOGY

This section includes the dataset description, data pre- processing techniques and classification algorithms. R programming language is used for analyzing the data.

Dataset Description

The heart disease dataset collected from [3] is used for this study. The dataset is of 918 observations and contains 11 independent variable and a categorical variable, whether there exists heart disease or not, as target variable. The variables of the dataset are
- Age: Age of the patient [Years]
- Sex: Sex of the patient [M: Male, F: Female]
  - ChestPainType: Chest Pain Type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  - RestingBP: Resting Blood Pressure [mm Hg]
  - Cholesterol: Serum Cholesterol [mm/dl]
  - FastingBS: Fasting Blood Sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  - RestingECG: Resting Electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of >
    
    0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
  - MaxHR: Maximum heart rate achieved [Numeric value between 60 and 202]
  - ExerciseAngina: Exercise-induced Angina [Y: Yes, N: No]
  - Oldpeak: Oldpeak = ST [Numeric value measured in depression]
  - ST_Slope: The slope of the peak exercise ST segment [Up: up sloping, Flat: flat, Down: down sloping]
  - HeartDisease: Output class [1: Heart disease, 0: Normal]
Data Preprocessing

Some pre-processing is required to make the data usable. We need to clean the data and code the attributes to numbers to fit classification models.

Firstly, we can see that there are no missing values in our data.

Secondly, there are some 0 values in the columns RestingBP and Cholesterol. But Resting Blood Pressure and Serum Cholesterol of a person can never be 0. So, these are bad values. These zeros are replaced with median values of the corresponding columns. Also, there are some negative values in the column Oldpeak. These negative values are converted to positive.

It is found from the summary of the raw data that about 77% values of the column FastingBS is 0. So, this column will not impact greatly on classification. So FastingBS column is dropped.

Values of some columns are categorical variables. So, we code them into numbers. The changes are shown in Table III.

Now the dataset is ready for analysis.

Multicollinearity

Multicollinearity is a statistical measure which measures the inter-correlations between the independent variables of the

	Age	RestingBP	Cholesterol	MaxHR	Oldpeak
Count	918	918	918	918	918
Min.	28.00	0.0	0.0	60.00	2.6000
1st Qu.	47.00	120.0	173.2	120.0	0.0000 /td>
Median	54.00	130.0	223.0	138.0	0.6000
Mean	53.51	132.4	198.8	136.8	0.8874
3rd Qu.	60.00	140.0	267.0	156.0	1.5000
Max.	77.0	200.0	603.0	202.0	6.2000

TABLE I. SUMMARY OF CONTINUOUS VARIABLES OF RAW DATA

Sex	F: 193 M: 725
ChestPainType	ASY: 496 ATA: 173 NAP: 203 TA: 46
FastingBS	0: 704 1:214
RestingECG	LVH: 188 Normal: 552 ST: 178
ExerciseAngina	N: 547 Y: 371
ST_Slope	Down: 63 Flat: 460 Up: 395
HeartDisease	0: 410 1: 508

TABLE II. SUMMARY OF CATEGORICAL VARIABLES OF RAW DATA

data. For classification, non-existence of multicollinearity is required. If multicollinearity exists, skewed or misleading results can be obtained, when we study the power of each variable independently to predict or interpret the dependent variable using a statistical model. In presence of multicollinearity, to find out the effect of independent variables in a statistical model, wider confidence interval with less accurate probabilities can be produced. One might not able to trust the p-values to identify independent variables that are

Column Name	Actual Value	Coded Value
Sex	M	1
Sex	F	2
ChestPainType	ATA	1
	NAP	2
	ASY	3
	TA	4
RestingECG	Normal	1
	ST	2
	LVH	3
ExerciseAngina	Y	1
ExerciseAngina	N	0
ST_Slope	Down	-1
	Flat	0
	Up	1

TABLE III. LIST OF CODING INTO NUMERIC VALUES

statistically significant.

Variance Inflation Factor (VIF)

Variance Inflation Factor or VIF is a measure of amount of multicollinearity in a dataset. Mathematically, VIF is measured by

VIFj = , (1)

where is the multiple correlation coefficient between jth and other independent variables. A large value of VIF indicates high existence of multicollinearity in the variables. Generally, we consider existence of multicollinearity if VIF is greater than 5 or 10, according to the situation. VIF value of less than 5 will generally be considered as non-existence of multicollinearity.
Principal Component Analysis (PCA)

PCA is a dimension reduction technique for the data, which has a large numbers of predictor variables. PCA is mainly used for two purposes- dimensionality reduction and checking existence of multicollinearity. Using principal component analysis if number of variables of the datasets can be reduced i.e., if the variance of many variables can be explained by some few principal components, then it can be concluded that multicollinearity should exist there.
Logistic Regression

Logistic regression is used when the objective is to classify the target variable into two or more categories. In binary Logistic regression model, the target variable is classified into two classes i.e., 0 and 1, which in our case refers to negative or positive respectively for heart disease. For fitting Logistic regression, the Sigmoid function is used to estimate the probability of the data point belonging to the positive class.
Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm that classifies cases by finding a separator. SVM first maps data to a high-dimensional feature space so that data points can be categorized even when the data points are not linearly or otherwise separable. Mapping data in higher dimensional space is called Kernelling. The mathematical function used for mapping is called a Kernel function. There are various types of Kernel functions. In this study, Radial Kernel is used to classify heart disease.
Random Forest Model

Random Forest method is a supervised machine learning algorithm for classification. It constructs multiple decision trees at a time. The decision is made on the majority of the decisions in the decision trees. The advantage of random forest over decision tree is random forest is free from the problem of high bias and low variance of decision tree. It also solves the overfitting problem of decision tree. Another advantage of using Random Forest model as a classifier is we get variable importance, which helps to understand the impact of various variables on classification, as an output.

Fig. 1. Simplification of Random Forest Model
NaÃ¯ve Bayes Classifier

NaÃ¯ve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong independence assumptions between the features. This model is used for binary classification, text

classification, spam filtration, sentiment analysis, recommendation system etc. The Bayes theorem is

P(y|X=x) = . (2)
Linear Discriminant Analysis (LDA)

Linear discriminant analysis is used as a tool for classification, dimension reduction and data visualisation. Despite its simplicity, LDA often produces powerful, reasonable and interpretable classification results. To incorporate classification by LDA, we consider a random variable X comes from one of the K classes with density fk(x) on p. A discriminant rule tries to divide the data space into K disjoint regions 1, 2, , K that represent all classes. Now x to class j is allocated if x is in region j following Bayesian rule or Maximum Likelihood rule according to the class prior probabilities are assumed or not respectively.
Cross Validation

Cross-validation is a resampling method that uses different partitions of the data to test and train a model on different iterations. It is mainly used in setting where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. At first, the shuffled dataset should be split into k groups. Then for each iteration, each of the k groups is to be considered as test set and the model should be trained over the remaining k-1 groups. Then we should summaries the outputs.
Performance Metrics

TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative.

Precision = TP / (TP + FP) (3)

Recall = TP / (TP + FN) (4) F-Score = (2 * Precision * Recall) / (Precision + Recall) (5)

EXPERIMENTAL RESULTS

Calculation of VIF

The Variance Inflation Factors of the independent variables of our dataset are given in Table IV.
Analyzing Principal Components

Analyzing PCA, obtained results are shown in Table V.
Train-Test Splitting

We split our dataset into train set and test set. 80% of total data is used to train the model and remaining rows are used to test the performance of the model. There are 734 and 184 observations respectively in the train and test set.

li>

Performance of models using Test Data

Five proposed models i.e., Logistic Regression, SVM, Random Forest model, NaÃ¯ve Bayes Classifier and LDA are fitted to the train set. Then the observations of test set are predicted using these models. Thus, obtained performance metrices for each model are shown in Table VI.

Accuracy Obtained from Cross Validation

We compared the models using resampling method. 10- Fold Cross-validation is used here. We repeated 10-fold cross-validation 3 times. So, total number of resamples is 30.

Cholesterol

31.044686

ExerciseAngina

27.399798

RestingBP

26.154423

Sex

13.510075

RestingECG

8.977059

Accuracy results obtained from cross-validation is shown in Table VII.
Variable Importance

From Random Forest model, variable importance is

TABLE IV. VARIANCE INFLATION FACTORS OF INDEPENDENT

VARIABLES

Variables	VIF
Age	1.361663
Sex	1.092017
ChestPainType	1.258605
RestingBP	1.100360
Cholesterol	1.038561
RestingECG	1.090604
MaxHR	1.428407
ExerciseAngina	1.455541
Oldpeak	1.539348
ST_Slope	1.622914

	Standard Deviation	Proportion of Variance	Cumulative Proportion
PC1	1.829433	0.304260	0.304260
PC2	1.14266	0.11870	0.42295
PC3	1.023245	0.095180	0.518140
PC4	0.9901791	0.0891300	0.6072700
PC5	0.9213231	0.0771700	0.6844400
PC6	0.908279	0.075000	0.759440
PC7	0.8358615	0.0635100	0.8229500
PC8	0.7840512	0.0558900	0.8788400
PC9	0.7166974	0.0467000	0.9255300
PC10	0.6671866	0.0404700	0.9660000
PC11	0.6115684	0.0340000	1.0000000

TABLE V. INFORMATION OF PRINCIPAL COMPONENTS

Model	Accuracy	Precision	Recall	F-Score
Logistic Regression	0.832	0.821	0.790	0.805
Support Vector Machine	0.832	0.821	0.790	0.805
Random Forest Model	0.864	0.859	0.827	0.843
NaÃ¯ve Bayes Classifier	0.832	0.798	0.827	0.812
Linear Discriminant Analysis	0.832	0.812	0.802	0.807

TABLE VI. PERFORMANCE METRICS PREDICTING OBSERVATIONS OF TEST SET

Model	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
LR	0.7826	0.8179	0.8470	0.8479	0.8767	0.9239
SVM	0.7935	0.8352	0.8478	0.8522	0.8767	0.9130
RF	0.7717	0.8478	0.8688	0.8693	0.8913	0.9239
NB	0.7609	0.8261	0.8478	0.8417	0.8587	0.9022
LDA	0.7826	0.8175	0.8525	0.8486	0.8696	0.9348

TABLE VII. ACCURACY FROM CROSS VALIDATION

obtained in terms of Mean Decrease in Gini coefficient. It is produced simultaneously during training of the model. Mean Decrease in Gini coefficients for each predictor variable obtained from the Random Forest model fitted to the training data is shown in Table VIII.

Variables	Mean Decrease in Gini
ST_Slope	88.109127
Oldpeak	45.819191
ChestPainType	44.594254
MaxHR	41.805309
Age	32.061294

TABLE VIII. VARIABLE IMPORTANCE

Fig. 2. Boxplot for Results of Cross Validation

DISCUSSION

From Table IV, it is seen that the VIFs are very close to 1. So, there is no significant multicollinearity in the data. Also, Table V shows that to explain 95% variance, 10 out of 11 principal components is required. So, no significant dimension reduction is possible. This also indicates non- existence of multicollinearity, which supports the information obtained from the VIF values. So, we were good to fit various classification models to predict heart disease to the dataset.

We got an accuracy of 86.4% when Random Forest model is used to predict test data. Resampling gives mean and median accuracy of 86.93% and 86.88% respectively when 10-fold cross-validation is used with 3 repetitions.

From Random Forest model we get the importance of various variables to make predictions. We get the value of mean decrease in Gini coefficient of ST_Slope as 88.109, which is the highest among all independent variables. This value is lowest for RestingECG which is 8.977.
CONCLUSION

From Table VI, it can be concluded that Random Forest model gives the best prediction of existence of heart disease when predictions are made using test data. From Table VII and Fig.2, it is seen that resampling also supports the fact of the Random Forest being the best model out of our experimented models. Rest of the models give more or less similar performance.

From the variable importance (Table 6) obtained from the Random Forest model, it can be interestingly noted that ST_Slope is the most important factor for prediction of heart disease. Oldpeak, ChestPainType, MaxHR are the next important variables respectively with close importance value. RestingECG and Sex are the least important variables.

ACKNOWLEDGMENT

Primarily the authors would like to thank the Almighty for being able to complete this work successfully. The authors would like to express their sincere gratitude to the respected professors of Department of Statistics, University of Kalyani and Department of Applied Statistics, MAKAUT WB for their cooperation to this project.

The authors would like to extend gratitude towards Kaggle website and creator of the dataset for keeping the dataset open-accessed. The data set is the spine of the study.

The authors would like to thank their parents, family members and friends for continuous support in various field of the study.

REFERENCES

[1] https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes- of-death

[2] https://www.downtoearth.org.in/blog/health/india-s-burden-of-heart- diseases-study-says-elderly-women-more-at-risk-74993

[3] https://www.kaggle.com/fedesoriano/heart-failure-prediction

[4] M. Pal and S. Parija, Prediction of Heart Diseases using Random Forest, Journal of Physics: Conference Series 1817 012009, 2021

[5] C. Sh. Zhu, C. U. Idemudia and W.F. Feng, Improved Logistic Regression model for Diabetes prediction by integrating PCA and K- means techniques, Informatics in Medicine Unlocked 17 (2019) 100179

[6] A. Yazdani, K.D. Varathan, Y.K. Chiam, A.W. Malik and W.A.W. Ahmad, A novel approach for Heart Disease prediction using Strength Scores with significant predictors, BMC Medical Informatics and Decision Making 21 (2021)

[7] A. Rajdhan , A. Agarwal , M. Sai , D. Ravi and P. Ghuli, Heart Disease prediction using Machine Learning, International Journal of Engineering Research and Technology 09 (2020)

[8] A. Rairikar, V. Kulkarni, V. Sabale, H. Kale and A. Lamgunde, Heart Disease prediction using Data Mining techniques, International Conference on Intelligent Computing and Control (I2C2), 2017

Cholesterol	31.044686
ExerciseAngina	27.399798
RestingBP	26.154423
Sex	13.510075
RestingECG	8.977059

A Study on Heart Disease Prediction using Different Classification Models based on Cross Validation Method

Leave a Reply