Developing A Web based System for Breast Cancer Prediction using XGboost Classifier

Nayan Kumar Sinha; Menuka Khulal; Manzil Gurung; Arvind Lal

doi:10.17577/IJERTV9IS060612

Volume 09, Issue 06 (June 2020)

Developing A Web based System for Breast Cancer Prediction using XGboost Classifier

DOI : 10.17577/IJERTV9IS060612

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 1,475
Authors : Nayan Kumar Sinha , Menuka Khulal , Manzil Gurung , Arvind Lal
Paper ID : IJERTV9IS060612
Volume & Issue : Volume 09, Issue 06 (June 2020)
Published (First Online): 26-06-2020
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Developing A Web based System for Breast Cancer Prediction using XGboost Classifier

Nayan Kumar Sinha, Menuka Khulal, Manzil Gurung, Arvind Lal

Department of Computer Science and Technology

Centre for Computers and Communication Technology, Chisopani, Sikkim, India

Abstract- In todays world cancer is the most common diseases which lead to greatest number of death. Cancer is not one disease; it is a group of more than 100 different and distinctive diseases. Cancer can involve in any tissue of the body and have many different forms and in each body part. Breast Cancer is a grim disease and it is the only type of cancer that is widespread among women worldwide. As the diagnosis of this disease manually takes long hours and the lesser availability of systems, there is a need to develop the automatic diagnosis system for early detection of cancer. So in this project we are developing a web based diagnosis system for which we have done the comparative study of the supervised machine learning classifiers to get to know which classifier is giving the best accuracy. For that we have taken dataset from the Wisconsin breast cancer database (WBCD) which is the benchmark database for comparing the results through different algorithms. In which we will use following classification techniques of machine learning like Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Adaboost Classifier and XGboost Classifier for the classification of benign and malignant tumor in which the machine is learned from the past data and can predict the category of new input.

Keywords- WBCD, Support Vector Machine, K-Nearest Neighbor, Random Forest, Adaboost Classifier and XGboost Classifier.

INTRODUCTION

Breast cancer has become one of the most common diseases among women that lead to death. Breast cancer can be diagnosed by classifying tumors. There are two different types of tumors i.e. malignant and benign tumors. Doctors need a reliable diagnosis procedure to distinguish between these tumors. But generally it is very difficult to distinguish the tumors even by the experts. So automation of diagnostic system is needed for diagnosing. As the most prevalent cancer in women, breast cancer has always had a high incidence rate and mortality rate. According to the latest cancer statistics, breast cancer alone is expected to account for 25% of all new cancer diagnoses and 15% of all cancer deaths among women worldwide. In case of any sign or symptom, usually people visit doctor immediately, who may refer to an oncologist, if required. The oncologist can diagnose breast cancer by: Undertaking thorough the patients medical history, examination of both the breasts and also check for swelling or hardening of any lymph nodes in the armpit. Here in this project, we have used the Wisconsin Breast Cancer Dataset (WBCD) of fine needle aspiration biopsy method and with that of the dataset we have invoked the machine learning algorithms to predict whether the patient is having breast cancer or not. This paper compares performance of five classification

algorithms and their combination using ensemble approach that are suitable for direct interpretability of their results. We are using an XGboost classifier approach to compare other four classification algorithms and done the analysis of each classifiers accuracy of the best fit for the prediction of breast cancer.
PROBLEM STATEMENT

To identify which machine learning classifier gives the best accuracy. To count the number of patients having benign and malignant and also identify the type of tumor.
PROPOSED METHODOLOGY

We acquire the breast cancer dataset of Wisconsin Breast Cancer diagnosis dataset and used jupyter notebook and Anaconda Spyder as the platform for the purpose of coding and get the Prediction UI (user interface) output from the flask as in local server. Our methodology involves use of supervised learning algorithms and classification technique like Support Vector Classifier, KNN, Random Forest, Adaboost and Xgboost Classifier, with Dimensionality Reduction technique.
Generally, dataset contains features which highly vary in magnitudes, units and range. So there is a need to bring all

features to the same level of magnitudes. This can be achieved by scaling.

3.5 Model Selection

This is the most important phase where machine learning algorithm selection is done for the developing a system where Data Scientists use various types of Machine Learning algorithms which can be classified as: supervised learning and unsupervised learning. For this breast cancer Prediction System, we only need Supervised Learning.

3.5.1 Supervised Learning

The supervised learning algorithm learns from the training data, which helps you to predict the outcomes for unpredicted data. It helps you to optimize performance criteria using experience also helps you to solve various types of real-world computation problems and such classifiers that are used mostly briefly explained below.

3.5.1. (I) Support Vector Machine (SVM)

It is one of the most popularized Supervised Learning algorithm, which is used for Classification as well as Regression problems. However, basically, it is used for Classification problems in Machine Learning scenario. The intent of the SVM algorithm is to create the best decision boundary that can segregate n-dimensional space into classes so that it can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane of SVM.

3.5.1. (III) Random Forest Classifier

Random Forest classifier is a learning method that operates by constructing multiple decision trees and the final decision is made based on the majority of the trees and is chosen by the random forest. It is a tree-shaped diagram used to determine a course of action. Each branch of the tree represents a possible decision, instance, or reaction. Using of Random Forest Algorithm is one of the main advantages is that it reduces the risk of over fitting and the required training time. Additionally, it also offers a high level of accuracy.

It runs efficiently in large databases and produces almost accurate predictions by approximating missing data.

Fig 7: Random Forest Classifier

Using of Random Forest Algorithm is one of the main advantages is that it reduces the risk of over-fitting and the required training time. Additionally, it also offers a high level of accuracy and produces highly accurate predictions by estimating missing data.

Fig 5: Support Vector Machine

3.5.1. (II) K – Nearest Neighbor (K-NN)

It is one of the simplest Machine Learning algorithms based on Supervised Learning technique. And assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. It stores all the available data and classifies a new data point based on the similarity and easily classified into a well suite category by using K- NN algorithm.

Fig 6: K – Nearest Neighbor

3.5.1. (IV) Adaboost Classifier

Ada-boost or Adaptive Boosting is an iterative ensemble boosting classifier. It builds a robust classifier by combining all poor performing classifiers to get the high accuracy, the concept behind Adaboost is to set the multiple weighs of classifiers and train the data in each iteration, hence it ensures the exact prediction of unusual observation. AdaBoost refers to a particular method of training a boosted classifier. Adaboost classifier is a classifier in the form of

Where each f_t is a weak learner that takes an object X as input and returns a value indicating the class of the object.
CONFUSION MATRIX

It is a summary of prediction results on a classification problem with the number of correct and incorrect predictions that are summarized with count values and broken down by each class. This is the key to the confusion matrix. It shows the ways in which your classification model get confused when it make predictions. It gives intuition not only into the errors being made by a classifier but more importantly the types of errors that are being made.

Classification Rate/Accuracy:

Classification Rate or Accuracy is given by the relation:

*100

The model is giving 0% type II error and it is best.
PROPOSED SYSTEM ARCHITECTURE As shown in below diagram, we first collected the

Dataset from Wisconsin Breast Cancer Dataset (WBCD). To applying a machine learning models, collecting appropriate data is very essential. After Collection of data, Cleaning needs to be done for removal of unwanted observations and for deleting duplicate or irrelevant values from dataset. Above mentioned Models have been comparatively studied which is used in this project and predicts the chances of breast cancer.

Fig 11: Work Flow
CONCLUSION AND FUTURE SCOPE

To analyse medical data, various data mining and machine learning methods are available. Its an important challenge in data mining and a machine learning area is to build accurate and computationally efficient classifiers for Medical applications. So in this project, we employed the machine learning classifier algorithms on the Wisconsin Breast Cancer (original) datasets and try to compare efficiency and effectiveness of those algorithms to find the

best classification accuracy, where XGBOOST classifier is giving us the maximum accuracy.

Well in Future Scope, various new deep learning algorithms are required to be implemented for the detection of different stages and categories of breast cancer simultaneously.

REFERENCES

Ch. Shravya, K. Pravalika, Shaik Subhani, Prediction of Breast Cancer Using Supervised Machine Learning Techniques International, Journal of Innovative Technology and Exploring Engineering (IJITEE) Volume-8 Issue-6, April 2019.
Mamta Jadhav[1], Zeel Thakkar[2], Prof. Pramila M. Chawan[3], Breast Cancer Prediction using Supervised Machine Learning Algorithms, International Research Journal of Engineering and Technology (IRJET)Volume: 06 Issue: 10 Oct 2019.
R. Chtihrakkanna, P. Kavitha, T. Mangayarkarasi, R. Karthikeyan, Breast Cancer Detection using Machine Learning, International Journal of Innovative Technology and Exploring Engineering (IJITEE) Volume-8 Issue-11, September 2019.
Mandeep Rana[1], Pooja Chandorkar[2], Alishiba Dsouza[3], Nikahat Kazi[4], Breast Cancer Diagnosis and Recurrence Prediction using Machine Learning techniques, IJRET: International Journal of Research in Engineering and Technology Volume: 04 Issue: 04 Apr-2015.
Varsha J. Gaikwad, Detection of Breast Cancer in Mammogram using Support Vector Machine, International Journal of Scientific Engineering and Research (IJSER) Volume 3 Issue 2, February 2015.
Susmitha Uddaraju[1], M. R. Narasingarao[2], A Survey of Machine Learning Techniques Applied for Breast Cancer Prediction, International Journal of Pure and Applied Mathematics (IJPAM) Volume 117 No. 19 2017.
Rajkamal kaur Grewal Babita Pandey, Two Level Diagnosis of Breast Cancer Using Data Mining, International Journal of Computer Applications (IJCA) Volume 89 No 18, March 2014.
Priyanka Gupta, Prof. Shalini L, Analysis of Machine Learning Techniques for Breast Cancer Prediction, International Journal Of Engineering And Computer Science (IJECS) Volume 7 Issue 5 May 2018.
Ravi Aavula, R. Bhramaramba, An Extensible Breast Cancer Prognosis Framework for Predicting Susceptibility, Recurrence and Survivability, International Journal of Engineering and Advanced Technology (IJEAT) Volume-8 Issue-5, June 2019.
Dania Abed Aljawad1, Ebtesam Alqahtani2, Ghaidaa AL- Kuhaili3, Nada Qamhan4, Noof Alghamdi5, Saleh Alrashed6, Jamal Alhiyafi7, Sunday O. Olatunji8, Breast Cancer Surgery Survivability Prediction Using Bayesian Network and Support Vector Machines, 978-1-4673-8765-1/17/$31.00 Â©2017 IEEE
Mehrdad J. Gangeh, Senior Member, IEEE, Simon Liu, Hadi Tadayyon, and Gregory J. Czarnota, Computer Aided Theragnosis Based on Tumour Volumetric Information in Breast Cancer, DOI 10.1109/TUFFC.2018.2839714, IEEE
Madhuri Gupta1, Bharat Gupta2, A Comparative Study of Breast Cancer Diagnosis Using Supervised Machine Learning Techniques, 978-1-5386-3452-3/18/$31.00 Â©2018 IEEE
Afsaneh Jalalian, Babak Karasfi, Machine Learning Techniques for Challenging Tumor Detection and Classification in Breast Cancer, 978-1-7281-2842-9/18/$31.00 Â©2018 IEEE
U. Karthik Kumar1, M.B. Sai Nikhil2 and K. Sumangali3, Prediction of Breast Cancer using Voting Classifier Technique, 978-1-5090-5905-8/17/$31.00 Â©2017 IEEE
Xingyui Li1 (Member, IEEE), Marko Radulovic2, Ksenija Kanjer2, and Konstantinos N. Plataniotis1, Discriminative Pattern Mining for Breast Cancer Histopathology Image Classification via Fully Convolutional Auto-encoder , (Fellow, IEEE) DOI 10.1109/ACCESS.2019.2904245, IEEE Access

Techniques	Accuracy without Standard scale	Accuracy with Standard Scale
SVM	57 %	96%
KNN	93%	57%
RF	97%	75%
Adaboost	94%	94%
XGboost	98%	98%

Developing A Web based System for Breast Cancer Prediction using XGboost Classifier

Leave a Reply