Detection of Breast Cancer using Machine Learning Techniques

DOI : 10.17577/IJERTV12IS040052

Download Full-Text PDF Cite this Publication

Text Only Version

Detection of Breast Cancer using Machine Learning Techniques

Ragul T N1, Karthikeyan I1, Vaidyanathan K1,and Dr.G R Hemalakshmi2 1Department of Computer Science and Engineering, National Engineering College, Kovilpatti,Tamil Nadu.

2Assistant Professor (Senior Grade), Department of Computer Science, National Engineering College, Kovilpatti, Tamil Nadu, India.

Abstract: Abnormal growth of breast cells causes breast cancer. These cells divide faster than healthy cells and continue to accumulate, forming a lump or mass. The cells can spread through the breast to the lymph nodes or other body parts. But fortunately, it is also curable cancer in its early stages. Breast cancer is among the 20 leading causes of death worldwide, affecting approximately 10% of the world's female population. As the number of people with breast cancer increases, effective predictive measures for the early diagnosis of breast cancer improve the prognosis and survival of patients. This study helps experts research preventive measures against breast cancer through early diagnosis using machine learning techniques. In this project, supervised Learning is used to analyze all features to determine whether a patient is affected by a benign or malignant tumour. The evaluation is performed on several patient datasets that contain features such as radius, texture, perimeter, area, and smoothness. Supervised Learning is a method in which a machine is trained on data in which inputs and outputs are labelled. A model can learn training data and process future data to predict outcomes. Therefore, machine learning techniques are of great importance in the early detection of breast cancer. These techniques support professionals and doctors in the early detection of breast cancer to prevent the development of the disease.

Keywords: Unsupervised Learning, predictive analysis


    Breast cancer is a type of cancer that occurs in the breast tissue. It occurs when cells in tissue grow uncontrollably and form a lump or tumour. Breast cancer is the most common and the second most common cancer in women worldwide. Although breast cancer can occur in both men and women, it is more common in women. It is estimated that 1 in 8 women in the United States will develop breast cancer in their lifetime.

    There are many types of breast cancer, and treatment varies by type, stage, and other factors. The most common types of breast cancer include ductal carcinoma in situ, invasive ductal carcinoma, and invasive lobular carcinoma. Breast cancer can be detected early with screening tests such as mammography, which can help improve treatment options. Early diagnosis is essential because it improves prognosis and increases the chances of survival.

    Breast cancer is caused by the abnormal growth of breast cells, which divide more rapidly than healthy cells and accumulate to form a lump or mass. The cancer cells may spread to other parts of the body or lymph nodes. Although breast cancer is among the top 20 causes of death worldwide and affects around 10% of women globally, it can be cured in its early stages.

    To improve early diagnosis and increase patient survival rates, experts are exploring preventive measures for breast cancer using machine learning techniques. In this project, supervised Learning is used to analyze features such as radius, texture, perimeter, area, and smoothness and determine whether a patient has a benign or malignant tumour. Supervised Learning is a method where the machine is trained on labelled input and output data. The model learns from the training data to predict outcomes for future data. Therefore, machine learning techniques are critical for early breast cancer detection and support doctors and experts in preventing the disease.


    Tianyu Shen et al. l. proposed a hierarchical fused model based on deep Learning and fuzzy Learning for breast cancer diagnosis based on lesion segmentation and disease grading. The critical point is to alleviate the drawbacks of deep Learning in terms of interpretability, generalization ability, and few-shot Learning. The proposed model consists of a pixel-wise segmentation unit based on ResU-seg Net, a feature extraction unit based on domain knowledge, and a severe grading classification unit based on the IT2PFCM-fused feedforward neural network. The feature representation and rule-based Learning integrated with domain knowledge ensured the interpretability of the system. Both segmentation and disease grading performance in a few-shot learning manner are improved. Cross-dataset research proved the improvement of generalization ability.

    Ravi K. Samala et al. l. works demonstrate that multi-stage transfer learning can utilize the knowledge gained through source tasks from unrelated and related domains. And show that the limited data availability in a target domain can be alleviated with pre-training the Convolutional Neural Network using data from similar auxiliary domains. And also show that the gain in

    Convolutional Neural Network performance from the additional stage of fine-tuning with the auxiliary data depends on the relative sizes of the available training samples in the target and the auxiliary domains and the proper selection of the transfer learning strategy. Furthermore, when the training sample size is small, the variance in the performance of the trained Convolutional Neural Network is significant. Reporting the best performance through exhaustive searches using a "test" set can be overly optimistic. It is, therefore, essential to validate the generalizability of the trained Convolutional Neural Network with unknown independent cases.

    Yongjin Zhou et al. l. proposed that the most significant addition of this work is a Convolutional Neural Network-based radionics technique on shear wave elastography for breast cancer diagnosis. It is the first attempt to use radionics based on Convolutional Neural Network to automatically extract high-throughput features from shear-wave elastography to classify malignant and benign breast tumours. And another significant contribution is that segmenting the tumour is unnecessary; it can reduce a lot of work and improve the classification model performance. Because this method doesn't need segmentation in advance and manual feature extraction, it has excellent potential to be applied to Computer-Aided Diagnosis Systems. The classification model is extendable and flexible that can be trained again to generalize for the new dataset.

    Jun Xu et al. l. proposed a Stacked Sparse Auto-encoder framework for automated nuclei detection of breast cancer histopathology. The Stacked Sparse Autoencoder model can capture high-level feature representations of pixel intensity in an unsupervised manner. These high-level features enable to decrease the discrepancy between input and reconstruction as much as possible by learning encoder and decoder networks which yields a set of weights and biases.

    Michiel Kallenberg et al., l. proposed a technique that builds a feature hierarchy from raw data. Breast density segmentation and grading of mammographic texture are two distinct tasks that can be addressed when the learnt features are applied as the input to a straightforward classifier. The suggested model picks up features at various scales. A unique sparsity that considers both lifetime and population sparsity is introduced to manage the model's capacity. Additionally examined, the approach using three various clinical datasets. Additionally, cutting-edge research demonstrates that the learnt breast density scores correlate with manual ones and that the learned texture scores diagnose breast cancer. The model is straightforward to use and generalizes to a wide range of different segmentation and scoring issues.

    Bolei Xu et al. l. present a brand-new deep hybrid attention network to classify breast cancer histological images. The network's intricate attention technique can automatically identify the valuable parts from the photos in the Break His dataset; as a result, the network does not need to resize the image to prevent information loss. Comparing our framework selection mechanism to the previous partially observable Markov decision process-based approach, training time can be cut in half and test our methodology on a publicly available dataset, where it achieves about 98% accuracy at four different magnifications.

    Jingxin Liu et al. l., Proposed a method that employs a single fully convolutional network to extract every nucleus region, mimicking the decision-making process of pathologists (tumour and nontumor). This multi-column convolutional neural network uses the outputs of the first two fully convolutional networks. The image describing the staining intensity as an input serves as the high-level decision-making mechanism to directly output the H Score of the input Tissue microarray image. A second fully convolutional network to extract the tumour nuclei region. This first end-to-end system uses a Tissue microarray image as the input and directly produces a clinical score. It will discuss experimental findings that show the H-Scores predicted by the model have a robust and statistically significant correlation with the scores of seasoned pathologists and that the discrepancy between the H-Scores of the algorithm and the pathologists is comparable to the inter-subject discrepancy between the pathologists.

    Mandeep Rana et al. l. proposed that the performance of each algorithm varies based on the dataset and parameter choices. K- Nearest Neighbor methodology has produced the best outcomes overall. Naive Bayes and logistic regression have also shown promising results in diagnosing breast cancer. As previously stated, Support Vector Machine is a powerful technique for predictive analysis. In light of the previous finding, we conclude that Support Vector Machine with a Gaussian kernel is the best technique for predicting whether breast cancer will cure.

    Vikas Chaurasia et al. l. applied three breast cancer survival prediction models to two parameters: patients with benign and malignant cancer. Employed the Naive Bayes, Reverse Path Forwarding Network, and J48 data mining techniques here. The University of California Irvine Machine Learning repository provided a dataset, which was obtained. To create the prediction models, and used data selection, preprocessing, and transformation. In this study, survivability was represented by a binary categorical survival variable computed from the variables in the raw dataset, where benign is represented by a value of and malignant is represented by a value. It employed a 10-fold cross-validation process to assess the three methods' unbiased prediction performance. Divided the dataset into 0 mutually exclusive partitions using a stratified sampling technique to do this. For each of the three prediction models, repeat this procedure. This gave a less biased way to compare the three models' prediction performance. The acquired findings showed that the J48 came in third with a classification accuracy of 93.41%, followed by Reverse Path Forwarding Network in second place with a classification accuracy of 96.77%. The Naive Bayes did the best, scoring 97.36%. To better understand the relative contributions of the independent variables to predicting survivability, also performed sensitivity analysis and specificity analysis on Naive Bayes, Reverse Path Forwarding Network, and J48 in addition to the prediction model. According to the sensitive data, the prognosis factor Class is the most significant predictor.

    Vanlalhmangaihsanga et al. l., Proposed that, in contrast to other models, the K-Nearest Neighbor and Logistic Regression have insultingly poor accuracy during the training process. Compared to Decision Tree and Random Forest classifiers, which have an accuracy of 75%, the Support Vector Machine also performed significantly better in classification errors, including correctly and erroneously categorized instances. Deploying the model on the testing data set after training it and assessing the algorithms, effectiveness on the training dataset is known. Surprisingly, the accuracy of the top-performing classifiers was 97% for the Logistic Regression and Random Forest, compared to 57% for the Support Vector MachineLogistic Regression and the Random Forest classifier are on equal footing when looking at accuracy alone. But when efficiency, sensitivity, and specificity are considered, random forest classifiers perform significantly better.


    The data consists of 569 patients and 32 characteristics. These characteristics formed 32 columns in the dataset. Features are Id, Diagnosis, Radius_mean, Texture_mean, Perimeter_mean, Area_mean, Smoothnes_mean, Concavity_mean, Compactness_mean, Concavepoints_mean, Symmetry_mean, fractal_dimension_mean, Radius_se, Texture_se, Perimeter_se, Area_se, Smoothnes_se, Concavity_se, Compactness_se, Concavepoints_se, Symmetry_se, fractal_dimension_se, Radius_worst, Texture_worst, Perimeter_worst, Area_worst, Smoothnes_worst, Concavity_worst, Compactness_worst, Concavepoints_worst, Symmetry_worst, fractal_dimension_worst.

    The target is the classification which is either benign breast cancer or malignant breast cancer. The data needs cleaning because it has Null, and the numeric features must be forced to float. We were instructed to get rid of all rows with Null. At first, the dataset is fetched using the Panda's library, and then we save the data inside a Panda's data frame. Initially, it counted the rows and columns in the dataset; there were 569 and 32 columns. This dataset consists of many null values; it counts the columns with null values then the columns with null values drop because the model cannot process the null values.


    The Data are retrieved from the input dataset by using Panda's library. Pandas provide a unique method to retrieve rows from a Data frame. Data frame. Loc [] method is a method that takes only index labels and returns a row or data frame if the index label exists in the caller data frame. The Data are retrieved from the input dataset by using Panda's library.

    Pandas provide a unique method to retrieve rows from a Data frame. Data frame. Loc [] method is a method that takes only index labels and returns a row or data frame if the index label exists in the caller data frame. Figure.3.2 Data set of the patient

    Figure.3.2 Data set of the patient

    Figure 3.2 contains the data and attributes taken into inconsideration for the diction of Breast Cancer Detection. Data visualization is the discipline of understanding data by placing it into visual form to interactively and efficiently convey insights so that the patterns, trends and correlations of the data that might not otherwise be detected can be visualized in large data sets. It removes the noise from the data and highlights valuable information. As visualization makes it easier to detect patterns, trends and outliers and provides precise, better and reliable results, it is implemented in this paper by creating a count plot, pair plot and heat map. In this work, data visualization is done with the help of the seaborn library. Data visualization is the discipline of understanding data by placing it into visual form to interactively and efficiently convey insights so that the patterns, trends and correlations of the data that might not otherwise be detected can be visualized in large data sets. It removes the noise from the data and highlights valuable information. As visualization makes it easier to detect patterns, trends and outliers and provides precise, better and reliable results, it is implemented in this paper by creating a count plot, pair plot and heat map. In this work, data visualization is done with the help of the seaborn library.

      1. Support Vector Machine (SVM)

        Support vector machine is the modern, high-speed machine learning algorithm for solving multiclass classification problems for large datasets based on a simple iterative approach. The SVM model is created in the dataset's linear CPU time. SVM can be used for the high dimensional dataset in the sparse and dense format. A support Vector Machine is a supervised classifier algorithm. It is used kernel trick for solving the classification problem. Based on these transformations, the ideal edge is found between the possible outputs. SVM is used for the nonlinear kernel, such as RBF. For the linear kernel, SVM is an appropriate choice. SVM classifier is sufficient for all linear problems. This algorithm gave an accuracy of about 95.12%.

      2. Recurrent Neural Network (RNN)

        A recurrent neural network (RNN) is an artificial neural network that uses sequential or time series data. Well-known programmes like Siri, voice search, and Google Translate include these deep learning algorithms. They are frequently employed for ordinal or temporal issues in language translation, natural language processing (NLP), speech recognition, and picture captioning. Recurrent neural networks (RNNs) use training data to learn as feedforward and convolutional neural networks (CNNs) do. They stand out because of their "memory," which allows them to affect the current input and output using data from previous inputs. Recurrent neural networks' output relies on the sequence's fundamental components, but typical deep neural networks presume that inputs and outputs are independent. Unidirectional recurrent neural networks cannot consider future events in their forecasts, although they would also contribute to the output of a particular sequence.

      3. onvolutional Neural Networks (CNNs)

        CNN is an artificial neural network widely used for image/object recognition and classification. Deep Learning thus recognizes objects in an image by using a CNN. CNNs play a significant role in diverse tasks/functions like image processing problems, computer vision tasks like localization and segmentation, video analysis, recognizing obstacles in self-driving cars, and speech recognition in natural language processing. As CNNs play a significant role in these fast-growing and emerging areas, they are trendy in Deep Learning.

        CNN is a different neural network that can find important information in time series and visual data. It is valuable for image- related tasks like image identification, object categorization, and pattern recognition. A CNN uses concepts from linear algebra, such as matrix multiplication, to find patterns in a picture. CNNs may also categorize signal and audio data. The structure of a CNN is comparable to the connection structure of the human brain. Like the brain has billions of neurons, CNNs also have neurons but are structured differently. The neurons in CNN are designed to resemble the brain's frontal lobe, which processes visual inputs.

        This configuration ensures that the entire visual field is covered, eliminating the piecemeal picture processing of standard neural networks. Compared to the older networks, a CNN performs better with image and speech or audio signal inputs. A deep- learningA convolutional layer, a pooling layer, and a fully connected (FC) layer are the three layers that makeup CNN. The first layer is the convolutional layer, while the final layer is the FC layer. The complexity of the CNN grows from the convolutional layer to the FC layer. Due to the image's growing complexity, CNN can recognize more essential details and intricate aspects until it locates the item. This increasing complexity allows the CNN to successfully identify more significant portions and complex features of an image until it finally identifies the object.

      4. Naïve Bayes (NB)

    The Bayes theorem is the foundation of the supervised learning technique, the Naive Bayes algorithm, employed to resolve distributional classification issues. It is mainly utilized in high-dimensional training datasets for text categorization. One of the easiest and most effective classifiers is the Naive Bayes model. Classification algorithms aid in the development of quick machine- learning models with rapid prediction capabilities. It is a probabilistic classifier, which implies that it bases its assumptions on the likelihood that an item exists. Spam filtration, Sentimental analysis, and material classification are a few well-known applications of the Naive Bayes algorithm. Naive Bayes (NB) algorithm for breast cancer detection and demonstrated the certainty results as 93%


    As expected, the model predicts whether the patient has a benign or malignant level of tumors.

    Accuracy Check

    Figure 4.1 Accuracy Check of Algorithm

    The above Figure 4.1 shows the model's accuracy analyzed with the actual values of a breast cancer diagnosis. The accuracy check of an algorithm that detects breast cancer is crucial in ensuring the reliability and effectiveness of the system. The algorithm's accuracy is measured by comparing its results with the actual diagnosis of a set of patients. This process is commonly known as validation or testing. The accuracy check of a breast cancer detection algorithm involves the evaluation of the algorithm's sensitivity and specificity. Sensitivity refers to the algorithm's ability to correctly identify patients with breast cancer, while specificity refers to correctly identifying patients without breast cancer. An algorithm with high sensitivity and specificity indicates a more reliable system to detect breast cancer accurately. Therefore, an accurate algorithm with high sensitivity and specificity is essential for early diagnosis and improved patient outcomes.

    Data Encoding

    Figure 4.2 Encoding of data

    Figure 4.2 shows that the categorical data in column 'diagnosis' is encoded/ transformed from M and B to 1 and 0 using Label Encoder from sklearn. Preprocessing.

    Data Classification:

    Figure 4.3 Count Plot of Benign and Malignant Tumor

    Above, Figure 4.3 shows the plot representing the class distribution of diagnosed malignant and benign patients. Two hundred twelve malignant diagnosed patients, i.e., around 38% of the data and 357, i.e., 62% of patients diagnosed with a benign tumor.

    Pair Plot:

    Figure 4.4 Creating Pair Plot

    The above Figure.4.4 represents the pair plot of all the columns highlighting the diagnosis points. The orange points are for one, and the blue points are for 0. The pair is used to show the numeric distribution in the scatter plot.

    Heat Map

    Figure 4.5 Function of Heat Map

    The above Figure 4.5 shows the focus is on the light and the dark areas. It shows the strength of correlation.


      This study attempts to analyze various supervised machine-learning algorithms and select the most accurate model for breast cancer detection. The work focused on advancing predictive models with the help of Python to achieve better accuracy in predicting correct outcomes. The analysis of the result signifies that integration of data, feature scaling, and different classification methods and analysis provide markedly successful tools in prediction. It has also been observed that the model misdiagnosed a few patients with cancer when they were not having cancer and vice versa. Although the model is accurate when dealing with people's lives, further research in building the most accurate and precise model must be carried out for the better performance of classification techniques and to get the accuracy as close to 100% as possible. Thus, the tuning of each of the models is necessary for the building of a more reliable model.


[1] Pawe Filipczuk, Thomas Stevens, Adam Krzyak and Roman Monczak "Hierarchical Fused Model With Deep Learning and Type-2 Fuzzy Learning for Breast Cancer Diagnosis" IEEE Transactions on fuzzy systems,

[2] Ravi K. Samala, Heang-Ping Chan, Lubomir Hadjiiski, Mark A. Helvie, Caleb D. Richter, and Kenny H. Cha "Breast Cancer Diagnosis in Digital Breast Tomosynthesis: Effects of Training Sample Size on Multi-Stage Transfer Learning Using Deep Neural Nets " IEEE transactions on medical imaging, vol. 38, no. 3, march 2019.

[3] Yongjin Zhou, Jingxu Xu, Qiegen Liu, Cheng Li, Zaiyi Liu, Meiyun Wang, Hairong Zheng, and Shanshan Wang "A Radiomics Approach With CNN for Shear-Wave Elastography Breast Tumor Classification" IEEE Transactions on biomedical engineering, vol. 65, no. 9, September 2018

[4] Jun Xu*, Member, IEEE, Lei Xiang, Qingshan Liu, Senior Member, IEEE, Hannah Gilmore, Jianzhong Wu, Jinghai Tang, and Anant Madabhushi "Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images " IEEE transactions on medical imaging, vol. 35, no. 1, January 2016.

[5] Michiel Kallenberg*, Kersten Petersen, Mads Nielsen, Andrew Y. Ng, Celine M. Vachon, Katharina Holland, Rikke Rass Winkel, Nico Karssemeijer, and Martin Lillholm "Unsupervised Deep Learning Applied to Breast Density Segmentation and Mammographic Risk Scoring " IEEE transactions on medical imaging, vol. 35, no. 5, May 2016.

[6] Bolei Xu, Jingxin Liu, Xianxu Hou, Bozhi Liu, Jon Garibaldi IEEE, Ian O. Ellis, Andy Green, Linlin Shen, and Guoping Qiu "A Deep Selective Attention Approach to Breast Cancer Classification" IEEE Transactions on medical imaging, vol. 39, no. 6, June 2020.

[7] Jingxin Liu, Bolei Xu, Chi Zheng, Yuanhao Gong, Jon Garibaldi, Daniele Soria, Andrew Green, Ian O. Ellis, Wenbin Zou, and Guoping Qiu "An End- to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA " IEEE transactions on medical imaging, vol. 38, no. 2, February 2019.

[8] Mandeep Rana, Pooja Chandorkar and Alishiba Dsouza "Breast cancer diagnosis and recurrence prediction using machine learning techniques" International Journal of Research in Engineering and Technology Volume: 04 Issue: 04 | Apr-2015.

[9] Vikas Chaurasia, BB Tiwari and Saurabh Pal Prediction of benign and malignant breast cancer using data mining techniques Journal of Algorithms and Computational Technology Vol. 12(2),2018.

[10] D. Dubey, S.Kharya and S.Soni "Breast cancer detection using machine learning." International Journal of Computer Science and Information Technologies vol. 8, no. 6, June 2021