Detection of Breast Cancer using Machine Learning Techniques

Ragul T N; Karthikeyan I; Vaidhyanathan K; Dr. G R Hemalakshmi

doi:10.17577/IJERTV12IS040052

Volume 12, Issue 04 (April 2023)

Detection of Breast Cancer using Machine Learning Techniques

DOI : 10.17577/IJERTV12IS040052

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 28
Authors : Ragul T N , Karthikeyan I , Vaidhyanathan K , Dr. G R Hemalakshmi
Paper ID : IJERTV12IS040052
Volume & Issue : Volume 12, Issue 04 (April 2023)
Published (First Online): 24-04-2023
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Detection of Breast Cancer using Machine Learning Techniques

Ragul T N1, Karthikeyan I1, Vaidyanathan K1,and Dr.G R Hemalakshmi2 1Department of Computer Science and Engineering, National Engineering College, Kovilpatti,Tamil Nadu.

2Assistant Professor (Senior Grade), Department of Computer Science, National Engineering College, Kovilpatti, Tamil Nadu, India.

Abstract: Abnormal growth of breast cells causes breast cancer. These cells divide faster than healthy cells and continue to accumulate, forming a lump or mass. The cells can spread through the breast to the lymph nodes or other body parts. But fortunately, it is also curable cancer in its early stages. Breast cancer is among the 20 leading causes of death worldwide, affecting approximately 10% of the world's female population. As the number of people with breast cancer increases, effective predictive measures for the early diagnosis of breast cancer improve the prognosis and survival of patients. This study helps experts research preventive measures against breast cancer through early diagnosis using machine learning techniques. In this project, supervised Learning is used to analyze all features to determine whether a patient is affected by a benign or malignant tumour. The evaluation is performed on several patient datasets that contain features such as radius, texture, perimeter, area, and smoothness. Supervised Learning is a method in which a machine is trained on data in which inputs and outputs are labelled. A model can learn training data and process future data to predict outcomes. Therefore, machine learning techniques are of great importance in the early detection of breast cancer. These techniques support professionals and doctors in the early detection of breast cancer to prevent the development of the disease.

Keywords: Unsupervised Learning, predictive analysis

INTRODUCTION

Breast cancer is a type of cancer that occurs in the breast tissue. It occurs when cells in tissue grow uncontrollably and form a lump or tumour. Breast cancer is the most common and the second most common cancer in women worldwide. Although breast cancer can occur in both men and women, it is more common in women. It is estimated that 1 in 8 women in the United States will develop breast cancer in their lifetime.

There are many types of breast cancer, and treatment varies by type, stage, and other factors. The most common types of breast cancer include ductal carcinoma in situ, invasive ductal carcinoma, and invasive lobular carcinoma. Breast cancer can be detected early with screening tests such as mammography, which can help improve treatment options. Early diagnosis is essential because it improves prognosis and increases the chances of survival.

Breast cancer is caused by the abnormal growth of breast cells, which divide more rapidly than healthy cells and accumulate to form a lump or mass. The cancer cells may spread to other parts of the body or lymph nodes. Although breast cancer is among the top 20 causes of death worldwide and affects around 10% of women globally, it can be cured in its early stages.

To improve early diagnosis and increase patient survival rates, experts are exploring preventive measures for breast cancer using machine learning techniques. In this project, supervised Learning is used to analyze features such as radius, texture, perimeter, area, and smoothness and determine whether a patient has a benign or malignant tumour. Supervised Learning is a method where the machine is trained on labelled input and output data. The model learns from the training data to predict outcomes for future data. Therefore, machine learning techniques are critical for early breast cancer detection and support doctors and experts in preventing the disease.
RELATED WORK

Tianyu Shen et al. l. proposed a hierarchical fused model based on deep Learning and fuzzy Learning for breast cancer diagnosis based on lesion segmentation and disease grading. The critical point is to alleviate the drawbacks of deep Learning in terms of interpretability, generalization ability, and few-shot Learning. The proposed model consists of a pixel-wise segmentation unit based on ResU-seg Net, a feature extraction unit based on domain knowledge, and a severe grading classification unit based on the IT2PFCM-fused feedforward neural network. The feature representation and rule-based Learning integrated with domain knowledge ensured the interpretability of the system. Both segmentation and disease grading performance in a few-shot learning manner are improved. Cross-dataset research proved the improvement of generalization ability.

Ravi K. Samala et al. l. works demonstrate that multi-stage transfer learning can utilize the knowledge gained through source tasks from unrelated and related domains. And show that the limited data availability in a target domain can be alleviated with pre-training the Convolutional Neural Network using data from similar auxiliary domains. And also show that the gain in

Convolutional Neural Network performance from the additional stage of fine-tuning with the auxiliary data depends on the relative sizes of the available training samples in the target and the auxiliary domains and the proper selection of the transfer learning strategy. Furthermore, when the training sample size is small, the variance in the performance of the trained Convolutional Neural Network is significant. Reporting the best performance through exhaustive searches using a "test" set can be overly optimistic. It is, therefore, essential to validate the generalizability of the trained Convolutional Neural Network with unknown independent cases.

Yongjin Zhou et al. l. proposed that the most significant addition of this work is a Convolutional Neural Network-based radionics technique on shear wave elastography for breast cancer diagnosis. It is the first attempt to use radionics based on Convolutional Neural Network to automatically extract high-throughput features from shear-wave elastography to classify malignant and benign breast tumours. And another significant contribution is that segmenting the tumour is unnecessary; it can reduce a lot of work and improve the classification model performance. Because this method doesn't need segmentation in advance and manual feature extraction, it has excellent potential to be applied to Computer-Aided Diagnosis Systems. The classification model is extendable and flexible that can be trained again to generalize for the new dataset.

Jun Xu et al. l. proposed a Stacked Sparse Auto-encoder framework for automated nuclei detection of breast cancer histopathology. The Stacked Sparse Autoencoder model can capture high-level feature representations of pixel intensity in an unsupervised manner. These high-level features enable to decrease the discrepancy between input and reconstruction as much as possible by learning encoder and decoder networks which yields a set of weights and biases.

Michiel Kallenberg et al., l. proposed a technique that builds a feature hierarchy from raw data. Breast density segmentation and grading of mammographic texture are two distinct tasks that can be addressed when the learnt features are applied as the input to a straightforward classifier. The suggested model picks up features at various scales. A unique sparsity that considers both lifetime and population sparsity is introduced to manage the model's capacity. Additionally examined, the approach using three various clinical datasets. Additionally, cutting-edge research demonstrates that the learnt breast density scores correlate with manual ones and that the learned texture scores diagnose breast cancer. The model is straightforward to use and generalizes to a wide range of different segmentation and scoring issues.

Bolei Xu et al. l. present a brand-new deep hybrid attention network to classify breast cancer histological images. The network's intricate attention technique can automatically identify the valuable parts from the photos in the Break His dataset; as a result, the network does not need to resize the image to prevent information loss. Comparing our framework selection mechanism to the previous partially observable Markov decision process-based approach, training time can be cut in half and test our methodology on a publicly available dataset, where it achieves about 98% accuracy at four different magnifications.

Jingxin Liu et al. l., Proposed a method that employs a single fully convolutional network to extract every nucleus region, mimicking the decision-making process of pathologists (tumour and nontumor). This multi-column convolutional neural network uses the outputs of the first two fully convolutional networks. The image describing the staining intensity as an input serves as the high-level decision-making mechanism to directly output the H Score of the input Tissue microarray image. A second fully convolutional network to extract the tumour nuclei region. This first end-to-end system uses a Tissue microarray image as the input and directly produces a clinical score. It will discuss experimental findings that show the H-Scores predicted by the model have a robust and statistically significant correlation with the scores of seasoned pathologists and that the discrepancy between the H-Scores of the algorithm and the pathologists is comparable to the inter-subject discrepancy between the pathologists.

Mandeep Rana et al. l. proposed that the performance of each algorithm varies based on the dataset and parameter choices. K- Nearest Neighbor methodology has produced the best outcomes overall. Naive Bayes and logistic regression have also shown promising results in diagnosing breast cancer. As previously stated, Support Vector Machine is a powerful technique for predictive analysis. In light of the previous finding, we conclude that Support Vector Machine with a Gaussian kernel is the best technique for predicting whether breast cancer will cure.

Vikas Chaurasia et al. l. applied three breast cancer survival prediction models to two parameters: patients with benign and malignant cancer. Employed the Naive Bayes, Reverse Path Forwarding Network, and J48 data mining techniques here. The University of California Irvine Machine Learning repository provided a dataset, which was obtained. To create the prediction models, and used data selection, preprocessing, and transformation. In this study, survivability was represented by a binary categorical survival variable computed from the variables in the raw dataset, where benign is represented by a value of and malignant is represented by a value. It employed a 10-fold cross-validation process to assess the three methods' unbiased prediction performance. Divided the dataset into 0 mutually exclusive partitions using a stratified sampling technique to do this. For each of the three prediction models, repeat this procedure. This gave a less biased way to compare the three models' prediction performance. The acquired findings showed that the J48 came in third with a classification accuracy of 93.41%, followed by Reverse Path Forwarding Network in second place with a classification accuracy of 96.77%. The Naive Bayes did the best, scoring 97.36%. To better understand the relative contributions of the independent variables to predicting survivability, also performed sensitivity analysis and specificity analysis on Naive Bayes, Reverse Path Forwarding Network, and J48 in addition to the prediction model. According to the sensitive data, the prognosis factor Class is the most significant predictor.

Vanlalhmangaihsanga et al. l., Proposed that, in contrast to other models, the K-Nearest Neighbor and Logistic Regression have insultingly poor accuracy during the training process. Compared to Decision Tree and Random Forest classifiers, which have an accuracy of 75%, the Support Vector Machine also performed significantly better in classification errors, including correctly and erroneously categorized instances. Deploying the model on the testing data set after training it and assessing the algorithms, effectiveness on the training dataset is known. Surprisingly, the accuracy of the top-performing classifiers was 97% for the Logistic Regression and Random Forest, compared to 57% for the Support Vector MachineLogistic Regression and the Random Forest classifier are on equal footing when looking at accuracy alone. But when efficiency, sensitivity, and specificity are considered, random forest classifiers perform significantly better.
METHODOLOGY

The data consists of 569 patients and 32 characteristics. These characteristics formed 32 columns in the dataset. Features are Id, Diagnosis, Radius_mean, Texture_mean, Perimeter_mean, Area_mean, Smoothnes_mean, Concavity_mean, Compactness_mean, Concavepoints_mean, Symmetry_mean, fractal_dimension_mean, Radius_se, Texture_se, Perimeter_se, Area_se, Smoothnes_se, Concavity_se, Compactness_se, Concavepoints_se, Symmetry_se, fractal_dimension_se, Radius_worst, Texture_worst, Perimeter_worst, Area_worst, Smoothnes_worst, Concavity_worst, Compactness_worst, Concavepoints_worst, Symmetry_worst, fractal_dimension_worst.

The target is the classification which is either benign breast cancer or malignant breast cancer. The data needs cleaning because it has Null, and the numeric features must be forced to float. We were instructed to get rid of all rows with Null. At first, the dataset is fetched using the Panda's library, and then we save the data inside a Panda's data frame. Initially, it counted the rows and columns in the dataset; there were 569 and 32 columns. This dataset consists of many null values; it counts the columns with null values then the columns with null values drop because the model cannot process the null values.

FIG 3.1: WORKFLOW DIAGRAM

The Data are retrieved from the input dataset by using Panda's library. Pandas provide a unique method to retrieve rows from a Data frame. Data frame. Loc [] method is a method that takes only index labels and returns a row or data frame if the index label exists in the caller data frame. The Data are retrieved from the input dataset by using Panda's library.

Pandas provide a unique method to retrieve rows from a Data frame. Data frame. Loc [] method is a method that takes only index labels and returns a row or data frame if the index label exists in the caller data frame. Figure.3.2 Data set of the patient

Figure.3.2 Data set of the patient

Figure 3.2 contains the data and attributes taken into inconsideration for the diction of Breast Cancer Detection. Data visualization is the discipline of understanding data by placing it into visual form to interactively and efficiently convey insights so that the patterns, trends and correlations of the data that might not otherwise be detected can be visualized in large data sets. It removes the noise from the data and highlights valuable information. As visualization makes it easier to detect patterns, trends and outliers and provides precise, better and reliable results, it is implemented in this paper by creating a count plot, pair plot and heat map. In this work, data visualization is done with the help of the seaborn library. Data visualization is the discipline of understanding data by placing it into visual form to interactively and efficiently convey insights so that the patterns, trends and correlations of the data that might not otherwise be detected can be visualized in large data sets. It removes the noise from the data and highlights valuable information. As visualization makes it easier to detect patterns, trends and outliers and provides precise, better and reliable results, it is implemented in this paper by creating a count plot, pair plot and heat map. In this work, data visualization is done with the help of the seaborn library.
The Bayes theorem is the foundation of the supervised learning technique, the Naive Bayes algorithm, employed to resolve distributional classification issues. It is mainly utilized in high-dimensional training datasets for text categorization. One of the easiest and most effective classifiers is the Naive Bayes model. Classification algorithms aid in the development of quick machine- learning models with rapid prediction capabilities. It is a probabilistic classifier, which implies that it bases its assumptions on the likelihood that an item exists. Spam filtration, Sentimental analysis, and material classification are a few well-known applications of the Naive Bayes algorithm. Naive Bayes (NB) algorithm for breast cancer detection and demonstrated the certainty results as 93%
RESULT AND DISCUSSION

As expected, the model predicts whether the patient has a benign or malignant level of tumors.

Accuracy Check

Figure 4.1 Accuracy Check of Algorithm

The above Figure 4.1 shows the model's accuracy analyzed with the actual values of a breast cancer diagnosis. The accuracy check of an algorithm that detects breast cancer is crucial in ensuring the reliability and effectiveness of the system. The algorithm's accuracy is measured by comparing its results with the actual diagnosis of a set of patients. This process is commonly known as validation or testing. The accuracy check of a breast cancer detection algorithm involves the evaluation of the algorithm's sensitivity and specificity. Sensitivity refers to the algorithm's ability to correctly identify patients with breast cancer, while specificity refers to correctly identifying patients without breast cancer. An algorithm with high sensitivity and specificity indicates a more reliable system to detect breast cancer accurately. Therefore, an accurate algorithm with high sensitivity and specificity is essential for early diagnosis and improved patient outcomes.

Data Encoding

Figure 4.2 Encoding of data

Figure 4.2 shows that the categorical data in column 'diagnosis' is encoded/ transformed from M and B to 1 and 0 using Label Encoder from sklearn. Preprocessing.

Data Classification:

Figure 4.3 Count Plot of Benign and Malignant Tumor

Above, Figure 4.3 shows the plot representing the class distribution of diagnosed malignant and benign patients. Two hundred twelve malignant diagnosed patients, i.e., around 38% of the data and 357, i.e., 62% of patients diagnosed with a benign tumor.

Pair Plot:

Figure 4.4 Creating Pair Plot

The above Figure.4.4 represents the pair plot of all the columns highlighting the diagnosis points. The orange points are for one, and the blue points are for 0. The pair is used to show the numeric distribution in the scatter plot.

Heat Map

Figure 4.5 Function of Heat Map

The above Figure 4.5 shows the focus is on the light and the dark areas. It shows the strength of correlation.
1. CONCLUSION
  
  This study attempts to analyze various supervised machine-learning algorithms and select the most accurate model for breast cancer detection. The work focused on advancing predictive models with the help of Python to achieve better accuracy in predicting correct outcomes. The analysis of the result signifies that integration of data, feature scaling, and different classification methods and analysis provide markedly successful tools in prediction. It has also been observed that the model misdiagnosed a few patients with cancer when they were not having cancer and vice versa. Although the model is accurate when dealing with people's lives, further research in building the most accurate and precise model must be carried out for the better performance of classification techniques and to get the accuracy as close to 100% as possible. Thus, the tuning of each of the models is necessary for the building of a more reliable model.
2. REFERENCES

[1] Pawe Filipczuk, Thomas Stevens, Adam Krzyak and Roman Monczak "Hierarchical Fused Model With Deep Learning and Type-2 Fuzzy Learning for Breast Cancer Diagnosis" IEEE Transactions on fuzzy systems,

[2] Ravi K. Samala, Heang-Ping Chan, Lubomir Hadjiiski, Mark A. Helvie, Caleb D. Richter, and Kenny H. Cha "Breast Cancer Diagnosis in Digital Breast Tomosynthesis: Effects of Training Sample Size on Multi-Stage Transfer Learning Using Deep Neural Nets " IEEE transactions on medical imaging, vol. 38, no. 3, march 2019.

[3] Yongjin Zhou, Jingxu Xu, Qiegen Liu, Cheng Li, Zaiyi Liu, Meiyun Wang, Hairong Zheng, and Shanshan Wang "A Radiomics Approach With CNN for Shear-Wave Elastography Breast Tumor Classification" IEEE Transactions on biomedical engineering, vol. 65, no. 9, September 2018

[4] Jun Xu*, Member, IEEE, Lei Xiang, Qingshan Liu, Senior Member, IEEE, Hannah Gilmore, Jianzhong Wu, Jinghai Tang, and Anant Madabhushi "Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images " IEEE transactions on medical imaging, vol. 35, no. 1, January 2016.

[5] Michiel Kallenberg*, Kersten Petersen, Mads Nielsen, Andrew Y. Ng, Celine M. Vachon, Katharina Holland, Rikke Rass Winkel, Nico Karssemeijer, and Martin Lillholm "Unsupervised Deep Learning Applied to Breast Density Segmentation and Mammographic Risk Scoring " IEEE transactions on medical imaging, vol. 35, no. 5, May 2016.

[6] Bolei Xu, Jingxin Liu, Xianxu Hou, Bozhi Liu, Jon Garibaldi IEEE, Ian O. Ellis, Andy Green, Linlin Shen, and Guoping Qiu "A Deep Selective Attention Approach to Breast Cancer Classification" IEEE Transactions on medical imaging, vol. 39, no. 6, June 2020.

[7] Jingxin Liu, Bolei Xu, Chi Zheng, Yuanhao Gong, Jon Garibaldi, Daniele Soria, Andrew Green, Ian O. Ellis, Wenbin Zou, and Guoping Qiu "An End- to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA " IEEE transactions on medical imaging, vol. 38, no. 2, February 2019.

[8] Mandeep Rana, Pooja Chandorkar and Alishiba Dsouza "Breast cancer diagnosis and recurrence prediction using machine learning techniques" International Journal of Research in Engineering and Technology Volume: 04 Issue: 04 | Apr-2015.

[9] Vikas Chaurasia, BB Tiwari and Saurabh Pal Prediction of benign and malignant breast cancer using data mining techniques Journal of Algorithms and Computational Technology Vol. 12(2),2018.

[10] D. Dubey, S.Kharya and S.Soni "Breast cancer detection using machine learning." International Journal of Computer Science and Information Technologies vol. 8, no. 6, June 2021