# Heart Disease Prediction using Data Mining Techniques

Text Only Version

#### Heart Disease Prediction using Data Mining Techniques

Pratiksha Shetgaonkar

SRIEIT-Goa

Dr. Shailendra Aswale

SRIEIT-Goa

Abstract The heart is the most crucial & critical organ of the human body. Life is completely dependent on the efficient working & functioning of our heart. It is one of the major causes of mortality in today's world. Heart disease remains one of the most serious health issues of our day. It is said to be the primary motive in death globally. Many times it's difficult for medical professionals to expect a heart disease on time. Nowadays, the health sector contains a lot of precious hidden facts & information which could prove to be very helpful in making predictive decisions especially in the field of medicine. Data mining is a method or technique used to analyze vast datasets and then derive significant and useful results with the use of extraordinary AI-based techniques. This article attempts to use three of these AI-based methods namely Decision Tree, NaÃ¯ve Bayes, & Neural Network for forecasting cardiovascular or heart disease. All of these methods will be evaluated based on different unique & parameters with optimizations for better accuracy. The accuracy of each method will then be compared depending on accuracy based on various parameters. The best & accurate technique is then implemented for predicting whether or not a man or a woman will have coronary heart disease. This technique can be used by medical practitioners for early prediction of the disease so that timely care can be taken by the patient.

KeywordsData Mining, Artificial Intelligence, Heart, Disease, Prediction

1. INTRODUCTION

Cardiovascular disease has become one of the most widespread diseases in the world at present. It is estimated to have caused around 17.9 million deaths in 2017 which constitutes about 15% of all natural deaths [13]. Cardiovascular disease is chronic heart disease and can be detected at the initial stages by measuring the levels of various health parameters like blood pressure, cholesterol level, heart rate, and glucose level [13]. The cardiovascular disease not only affects human health but also the economics and cost of the countries [14]. Nowadays, several data mining algorithms and machine learning algorithms are b e i n g developed a n d r e searched for predicting the different types of diseases [28]. Similarly, there are many research article which shows that numerous data mining, machine learning, and th e hyb r id a lgorithm s are b e in g s t u d i e d , developed a n d investigated which can help detect the and predict the early stage of heart disease [22-26]. The heart disease diagnosis is the process of detecting or predicting heart disease from a patient's records. Doctors may not able to diagnose a patient properly in a short time, especially when the patients suffer from more than one disease [10]. The authors in [18] have surveyed numerous research papers from different years on the prediction of heart diseases and they concluded that data mining techniques are better at predicting heart diseases.

Classication techniques are used widely in healthcare because of their capabilities of processing very large data sets. The commonly used techniques in healthcare are NaÃ¯ve Bayesian, support vector machine, nearest neighbor, decision tree, Fuzzy logic, Fuzzy based neural network, Articial neural network, and genetic algorithms [1].

2. RELATED WORK

Several researchers and authors have studied, experimented with, and analyzed numerous techniques for heart disease predictions which includes the techniques for classification and feature selection.

The authors proposed the hybrid HRFLM approach by combining the characteristics of the Linear Method (LM) and Random Forest (RF). They obtained a prediction accuracy of 88.4% [1].

The authors in one of the research done in 2019, tried to mainly increase the accuracy of prediction by using the various feature selection techniques. Different data mining techniques i.e. Decision Tree, Logistic regression, Logistic regression SVM, NaÃ¯ve Bayes, and Random forest are applied individually in Rapid miner on a UCI heart disease date set and compared results with the past researches and finally, the results concluded that the Logistic regression which obtained an accuracy of is 84.85% is the best feature selection technique for predicting heart disease[2]

In 2018, the researchers used the Prediction models by using the different combinations of features, and seven classification techniques: k-NN, DT, NB, LR, SVM), NN, and VOTE (a hybrid technique with NaÃ¯ve Bayes and Logistic Regression). And their experiment results showed that the best-performing data mining technique, the VOTE technique with NB and LR achieved an accuracy of 87.4% in heart disease prediction [3]. The 10-fold cross-validation technique was used to validate the performance of the models[3].

The authors in 2019 developed an automated diagnostic system based on 2 statistical model and DNN (2-DNN MODEL) for the improved diagnosis of heart disease. Their proposed method targeted the two main problems i.e., the problem of underfitting and overfitting, and proposed a diagnostic system that neither under fits nor overfits the training data and their proposed model gave the testing accuracy of 93.33%[4].

The authors in [5] proposed a hybrid model or system in which the researchers used the decision tree technique, i.e., the C4.5 algorithm, and combined it with ANN and named it as hybrid DT to produce the desired result. When this model was analyzed and compared with the C4.5 algorithm

and ANN on the same data set, it proved to be more accurate with an accuracy of 78.14%[5].

In 2019 the researchers implemented a hybrid approach combining various techniques that exploited the Fast Correlation-Based Feature Selection (FCBF) method to filter redundant features to improve the quality of heart disease classification. This method proved to be more than 90% accurate [6].

Few authors [7] used an ensemble of classifiers. The ensemble algorithms bagging, boosting, stacking and majority voting were employed for experiments. The proper selection techniques for feature sets helped to improve the accuracy of the ensemble algorithms. The highest accuracy was obtained with majority voting with the feature set FS2[7].

In 2020 the authors used the seven different intelligent techniques to predict coronary heart disease using the Starlog and Cleveland heart disease dataset and in their comparative study, the deep neural network performed better and obtained an accuracy of 98.15% with the Starlog dataset and in the case of Cleve- land dataset, SVM achieved an accuracy of 97.36%[8].

In 2019 in one of the research for diagnosis of heart disease, the authors used UCI machine learning repository for heart disease dataset and proposed a Multi-Layer Pi- Sigma Neuron Model (MLPSNM) for heart disease diagnosis which was based on PI-Sigma model in which, as per the authors, the architecture and calculation are less complex as compared to other previously pro- posed models. For the learning of the network, the BP algorithm wasused with bipolar sigmoid function activation function and PCA and LDA preprocessing methods are used to reduce the dimensionality of the dataset. In the SVM-LDA method, the attributes that are closer to the hyperplane are selected. For validation of the network, the k-fold validation method is used. The network converges after 50 iterations. The proposed model achieves 94.53% classification accuracy for diagnosis of heart disease by usin PCA [9].

The authors in [11] compared the use of several supervised machine learning (ML) algorithms for predicting clinical events in terms of their internal validity and accuracy and the results, which were obtained using two statistical software platforms that is R-Studio and Rapid Miner were then compared and showed that the decision tree algorithm gave better results.

The authors in [15] performed the comparative study of heart disease diagnosis system using top ten data mining classification algorithms [27]. The data mining algorithms discussed were C4.5, SVM, Ada Boost, KNN, Naive Bayes, and CART, Random Forest, Bagging Algorithm, Logistic Regression, and Multilayer Perceptron (MLP). From their experimental study in terms of accuracy, the top three algorithms were Random Forest with 78.0%, kNN with 71.6%, and MLP with 63.8% and the top three based on speed were AdaBoost, kNN, and Naive Bayes.

The authors in [16] carried the implementation of prediction algorithm and reached to the conclusion that the accuracy of the algorithms in machine learning depends upon the dataset that used for training and testing purpose[16].

In 2011 the researchers in[19]used the classification algorithm such as RIPPER (Repeated Incremental Pruning to Produce Error Reduction) proposed by William W Cohen, Decision tree, ANN, and Support Vector Machine, and their experimental results showed that Support Vector machine achieved the highest prediction accuracy[19]. The authors, Sellappan & Palaniappan in [20]proposed an advanced and Intelligent coronary heart disorder prediction machine (IHDPS) using three data mining techniques (naÃ¯ve Bayes, decision tree, neural network).

Authors K. Srinivas, B. Kavita Rani, and A. Govardhan presented the use of numerous data mining techniques to predict a heart attack. They used methods such as Decision Tree, Naive Bayes, and ANN [21]. Data mining tools, such as TANAGRA, were used in statistical learning algorithms.

3. PROPOSED METHODOLOGY

Based on the conclusion from our literature review we concluded that the three below mentioned techniques are better & efficient in classifying and predicting in terms of accuracy. Therefore we experimented with these three techniques that are;

1. Neural Network

2. Decision Tree

3. NaÃ¯ve Bayes.

4. EXPERIMENTATION AND PERFORMANCE ANALYSIS

1. DATASET

We have used the dataset from the UCI repository from this website link https://archive.ics.uci.edu/ml/datasets/Heart+Disease.

We also consulted the doctor nearby who helped us to add more data to our database.

Our datasets consisted of 14 attributes with 668 records, details of which are given in Table 1, below.

Table 1. Attributes of Heart Disease Dataset

 Sr. no Attribute Description Values 1 Age Age in years Continuous 2 Sex Male or female 1 = male 0 = female 3 Cp Chest pain type 1 = typical type 1 2 = typical type agina 3 = non-agina pain 4=asymptomatic 4 thestbps Resting blood pressure Continuous value in mm hg 5 chol Serum cholesterol Continuous value in mm/dl 6 Restecg Resting electrographic results 0 = normal 1 = having_ ST_T wave abnormal 2 = left ventricular hypertrophy 7 FBS Fasting blood sugar 1 120 mg/dl 0 120 mg/dl 8 thalach Maximum heart rate achieved Continuous value 9 exang Exercise-induced agina 0= no 1 = yes 10 oldpeak ST depression induced by exercise relative to rest Continuous value 11 slope The slope of the peak exercise ST segment 1 = unsloping 2 = flat 3 = downsloping 12 Ca Number of major vessels colored by floursopy 0-3 value 13 thal Defect type 3 = normal 6 = fixed 7 = reversible defect
2. DECISION TREE

Below table 2 and Figure 1, shows the Decision Tree tested for a different number of testing data, how the accuracy can be improved by removing some of the attributes and testing again.

After manipulating the dataset i.e. increasing and decreasing the training and testing data, we got the following results for the three data mining techniques for the prediction of heart disease shown in various tables and graphs below.

Table 2. Decision Tree Data (Snapshot with few configuration changes)

Figure 1. Attribute v/s Accuracy graph for table2

3. NAÃVE BAYES

The below table3 shows the NaÃ¯ve Bayes tested for different numbers of testing data, how the accuracy can be improved by removing some of the attributes and testing again.

Table 3: NaÃ¯ve Bayes Data (Snapshot with few configurations)

Figure 2. Attribute v/s Accuracy graph for table 3

4. NEURAL NETWORK

Below table 4 and Figure 3 show the accuracy obtained for neural networks tested on different hidden layers, changing the number of epochs, increasing and decreasing learning rate and folds, and the activation functions. For improving the accuracy, we removed some of the attributes

Table 4: Neural Network Data (Snapshot with few configuration changes)

Figure 3. Attribute v/s Accuracy graph for table 4

5. MODEL COMPARISON

Figure 5 shows the graph for the accuracy of the three data mining techniques.

Fig 5: Accuracy levels for three data mining techniques

5. CONCLUSION AND FUTURE SCOPE

From the above graphs obtained from our implementation, we can conclude that when we increase hidden layers, the result becomes less accurate and it also consumes more time

i.e. not efficient. Also After decreasing the learning rate the accuracy decreased. In the neural network, we got the highest accuracy i.e. 81.08% when we used a smaller number of hidden layers with increased learning rate and increased training dataset.

When we changed the attributes, the result was also changing. Removal of the chest pain and cholesterol attributes decreased the accuracy of a decision tree since both are important attributes for heart disease prediction. But after removing the sex attribute the accuracy remained unchanged, which led to our conclusion that this attribute doesn't play an important role in disease prediction.

We also tried to check the accuracy of NaÃ¯ve Bayes by removing some attributes but the results didn't change much because the NaÃ¯ve Bayes algorithm is independent of other attributes.

In a neural network, for finding better accuracy we tried with different hidden layers, learning rates, and changing Attributes. When we increased hidden layers, it gave better accuracy but its computation time increased which was not good for prediction, but when we reduced the number of hidden layers it gave us better results with much shorter calculation time which was reliable. After analyzing the above graphs we concluded that the decision tree was giving more accurate results with 98.54% as compared to other methods which we're givig 85.01% (NaÃ¯ve Bayes) and 81.83% (neural network). As we can see from the below graph.

We can make this system more efficient & reliable by using a more number of training datasets and evaluating the datasets. We can also try to increase the number of features such as Junk food, exercise, and tobacco to be more precise.

Also, there is a scope to improvise this system by integrating these approaches and forming a hybrid model that can deliver better outcomes than individual methods.

ACKNOWLEDGEMENTS

The authors of this research & study wishes to express their gratitude to their team members for providing their support in the completion of the implementation of this study, namely Amogh Power, Varsha Pawar, Seema Shilvant and Visheh Parab.

REFERENCES

1. Mohan, S., Thirumalai, C., & Srivastava, G. (2019). Effective heart disease prediction using hybrid machine learning techniques. IEEE Access, 7, 81542-81554.

2. Bashir, S., Khan, Z. S., Khan, F. H., Anjum, A., & Bashir, K. (2019, January). Improving heart disease prediction using feature selection approaches. In 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST) (pp. 619-623). IEEE.

3. Amin, M. S., Chiam, Y. K., & Varathan, K. D. (2019). Identification of significant features and data mining techniques in predicting heart disease. Telematics and Informatics, 36, 82-93.

4. Ali, L., Rahman, A., Khan, A., Zhou, M., Javeed, A., & Khan, J. A. (2019). An Automated Diagnostic System for Heart Disease Prediction Based on ${\chi^{2}}$ Statistical Model and Optimally Configured Deep Neural Network. IEEE Access, 7, 34938-34945.

5. Maji, S., & Arora, S. (2019). Decision tree algorithms for prediction of heart disease. In Information and Communication Technology for Competitive Strategies (pp. 447-454). Springer, Singapore.

6. Khourdifi, Y., & Bahaj, M. (2019). Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization. Int. J. Intell. Eng. Syst., 12(1), 242-252.

7. Latha, C. B. C., & Jeeva, S. C. (2019). Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Informatics in Medicine Unlocked, 16, 100203.

8. Ayon, S. I., Islam, M. M., & Hossain, M. R. (2020). Coronary artery heart disease prediction: a comparative study of computational intelligence techniques. IETE Journal of Research, 1-20.

9. Burse, K., Kirar, V. P. S., Burse, A., & Burse, R. (2019). Various preprocessing methods for neural network-based heart disease prediction. In Smart innovations in communication and computational sciences (pp. 55-65). Springer, Singapore

10. Tarawneh, M., & Embarak, O. (2019, February). Hybrid approach for heart disease prediction using data mining techniques. In International Conference on Emerging Internetworking, Data & Web Technologies (pp. 447-454). Springer, Cham.

11. Beunza, J. J., Puertas, E., GarcÃ­a-Ovejero, E., Villalba, G., Condes, E., Koleva, G., … & Landecho, M. F. (2019). Comparison of machine learning algorithms for clinical event prediction (risk of coronary heart disease). Journal of biomedical informatics, 97, 103257.

12. Gonsalves, A. H., Thabtah, F., Mohammad, R. M. A., & Singh, G. (2019, July). Prediction of coronary heart disease using machine learning: an experimental analysis. In Proceedings of the 2019 3rd International Conference on Deep Learning Technologies (pp. 51- 56).

13. Nalluri, S., Saraswathi, R. V., Ramasubbareddy, S., Govinda, K., & Swetha, E. (2020). Chronic Heart Disease Prediction Using Data Mining Techniques. In Data Engineering and Communication Technology (pp. 903-912). Springer, Singapore.

14. Gokulnath, C. B., & Shantharajah, S. P. (2019). An optimized feature selection based on genetic approach and support vector machine for heart disease. Cluster Computing, 22(6), 14777-14787.

15. Enriko, I. K. A. (2019, June). Comparative study of heart disease diagnosis using top ten data mining classification algorithms. In Proceedings of the 5th International Conference on Frontiers of Educational Technologies (pp. 159-164).

16. Singh, A., & Kumar, R. (2020, February). Heart Disease Prediction Using Machine Learning Algorithms. In 2020 International Conference on Electrical and Electronics Engineering (ICE3) (pp. 452-457). IEEE.

17. Barik, S., Mohanty, S., Rout, D., Mohanty, S., Patra, A. K., & Mishra,

1. K. (2020). Heart Disease Prediction Using Machine Learning Techniques. In Advances in Electrical Control and Signal Systems (pp. 879-888). Springer, Singapore.

18. A. Powar, S. Shilvant, V. Pawar, V. Parab, P. Shetgaonkar and S. Aswale, "Data Mining & Artificial Intelligence Techniques for Prediction of Heart Disorders: A Survey," 2019 International Conference on Vision Towards Emerging Trends in Communication and Networking (ViTECoN), Vellore, India, 2019, pp. 1-7, doi: 10.1109/ViTECoN.2019.8899547.

19. Kumari, M., & Godara, S. (2011). Comparative study of data mining classification methods in cardiovascular disease prediction 1.

20. Dangare, C. S., & Apte, S. S. (2012). Improved study of heart disease prediction system using data mining classification techniques.

International Journal of Computer Applications, 47(10), 44-48

21. Ashish C, Lakhan A, Sahil A and Prof Y.K.Sharma, P.(2016). Heart Disease Prediction Using Data Mining Techniques. International Journal Of Research in Advent Technology.

22. Das, Resul, Turkoglu, Ibrahim, et al.: Effective diagnosis of heartdisease through neural networks ensembles. J. Expert Syst. Appl.36, 76757680 (2009)

23. Das, Resul, Turkoglu, Ibrahim, et al.: Diagnosis of valvular heart disease through neural networks ensembles. J. Comput. Methods Progr. Biomed. 93, 185191 (2009)

24. Gokulnath, C., Priyan, M. K., Balan, E. V., Prabha, K. R., Jeyanthi, R.: Preservation of privacy in data mining by using PCA based perturbation technique. In: Smart Technologies and Man- agement for Computing, Communication, Controls, Energy and Materials (ICSTM), 2015 International Conference on (pp.202 206). IEEE (2015)

25. Babaoglu, et al.: Assessment of exercise stress testing with arti- ficial neural network in determining coronary artery disease and predicting lesion localization. J. Expert Syst. Appl. 36, 25622566 (2009)

26. Rajeswari, K., et al.: Feature selection in ischemic heart disease identification using feed forward neural networks. Int. Symp. Robot. Intell. Sens. 41, 18181823 (2012)

27. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., & Steinberg, D. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37 (2008).

28. Tilve, A., Nayak, S., Vernekar, S., Turi, D., Shetgaonkar, P. R., & Aswale, S. (2020, February). Pneumonia Detection Using Deep Learning Approaches. In 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE) (pp. 1-8). IEEE.