Early Detection of Lung Cancer using Data Mining Techniques: A Survey

G.Vijaya; Research Scholar; Dr.A.Suhasini

doi:10.17577/IJERTCONV1IS06013

ICSEM - 2013 (Volume 1 - Issue 06)

Early Detection of Lung Cancer using Data Mining Techniques: A Survey

DOI : 10.17577/IJERTCONV1IS06013

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 139
Total Downloads : 15
Authors : G.Vijaya, Research Scholar, Dr.A.Suhasini
Paper ID : IJERTCONV1IS06013
Volume & Issue : ICSEM – 2013 (Volume 1 – Issue 06)
Published (First Online): 30-07-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Early Detection of Lung Cancer using Data Mining Techniques: A Survey

Vijaya, Research Scholar(Ph.D), Dr.A.Suhasini, Associate Prof.,

Annamalai University, Annamalai University,

Annamalai Nagar. Annamalai Nagar.

Abstract:

Lung cancer is the leading cause of cancer death in the World States for both men & women. The early detection of lung cancer can be helpful in curing the disease completely. In general, a measure for early stage lung cancer diagnosis mainly includes X-ray chest films, CT scans, MRI scans, Biopsy etc. The present article surveys the role of different Data mining paradigms, like Decision Trees (DT), Artificial Neural Networks (ANN), Association Rule Mining (ARM) and Bayesian Classifier. The pros and cons of each Data Mining techniques were indicated and an extensive bibliography is also included.

Keywords: Decision Trees, ANN, ARM, Bayesian Classifier.
1. INTRODUCTION
  
  Lung cancer remains a major cause of morbidity and mortality worldwide, accounting for more deaths than any other cancer cause. According to [1], new Lung cancer cases for 2012 among males 116,470 and among females 109,690. Deaths due to Lung cancer among males 87,750 and in females 72,590. In World, tobacco use is responsible for nearly 1 in 5 deaths.
  1. Lung Cancer
    
    Lung cancer is a disease of abnormal cells growing multiplying and growing into tumour. Cancer cells can be carried away from the lungs in blood, or lymph fluid that surrounds lung tissues.
    
    Metastasis occurs when a cancer leaves the site where it began and moves into a lymph node or to another part of the body through the blood stream.
  2. Lung Cancer Types
    
    There are 3 types of Lung Cancer. They are:
    1. Non Small Cell Lung Cancer (NSCLC):
      
      According to Cancer Society, about 85% to 95% of Lung Cancers are this type. This can be sub divided into:
      1. Squamous cell carcinoma about 25% to
        
        30% of all lung cancers are squamous cell carcinomas. They are often linked to a history of smoking & tend to be found in the middle of the lungs, near a bronchus.
      2. Adenocarcinoma about 40% of lung cancers are adenocarcinoma type of cancer. This is the most common type of lung cancer seen in non-smokers. It is more common in women than in men and it is more likely to occur in younger people than other types of lung cancer. It is usually found in the outer region of the lung. People with this type of lung cancer tend to have a better outlook (prognosis) than those with other types of lung cancer.
      3. Large cell (undifferentiated) carcinoma this will carry about 10% to 15% of NSCLC cancers. It may appear in any part of the lung. It tends to grow and spread quickly, which can make it harder to treat.
      867
      
      G.Vijaya,Dr.A.Suhasini
    2. Small Cell Lung Cancer (SCLC):
      
      This type of lung cancer can occupy up to
      
      10% to 15% of all lung cancers. It is very rare for someone who has never smoked. SCLC often starts in the bronchi near the centre of the chest, and it tends to spread widely through the body.
      
      G.Vijaya,Dr.A.Suhasini
      
      868
      
      oval represents the leaf node (output).
      CART (Classification And Regression Tree) is a recursive and gradual refinement algorithm of building a decision tree, to predict the classification situation of new samples of known input variable value. Breiman et al 1984 provided this algorithm.
      
      Algorithm:
      
      Step 1: All the rows in a data set are passing onto the root node.
      
      Step 2: Based on the values for the rows in the node considered, each of the predictor variables are splited at all its possible split points.
      
      Step 3: At each split points, the parent node is split as binary nodes, by separating the rows with values lower than or equal to the split points and values higher than the split points, for the considered predictor variable.
      
      Step 4: The predictor variable and split point with
      
      the highest value is selected for the node.
    3. Other types of Lung Cancer:
      - Carcinoid tumours of the lung account for fewer than 5% of lung tumours. Most of them are slow growing & generally cured by surgery.
      - Cancers that starts in other organs (such as breast, pancreas, kidney or skin) can sometimes spread to the lungs. But these are not lung cancers.
    1.3. Detection of Lung Cancer:
    
    The prime method for cancer detection is through radiological imaging exams. There are many technologies used in the diagnosis of lung cancer like Chest X-ray, Lung CT scan, Lung PET scan etc.
    
    The present article provides an overview of the available literature on the detection of lung cancer in the data mining framework. Section II describes the basic notions of Decision Trees and its application in Lung Cancer. This is followed by Section III by a survey explaining the importance of ANN in the detection of Lung cancer. The utility and applicability of Fuzzy and Rule Mining techniques are briefly discussed in Section IV & Section V. Section VI concludes the article. Some challenges to data mining and the application of Lung cancer are also indicated.
2. DECISION TREE
  
  A Decision Tree [2] is a flowchart like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. The topmost node in a tree is the root node.
  
  It represents the patients having Lung Cancer have the Smoking habits or not. The rounded
  
  869
  
  G.Vijaya,Dr.A.Suhasini
  
  rectangle represents the tuples (attributes).
3. ARTIFICIAL NEURAL NETWORK (ANN)
  
  ANNs are networks of interconnected artificial neurons, and are commonly used for non-linear statistical data modelling to model complex relationships between inputs and outputs. The network includes a hidden layer of multiple artificial neurons connected to the inputs and outputs with different edge weights. The internal edge weights are learnt during the training process using techniques like back propagation. Several good decriptions of neural networks are available in [7] & [8].
  Advantages of Neural Networks include:
  - High tolerance of noisy data
  - Ability to classify patterns on which they have not been trained.
  - They are well suited for continuous valued inputs & outputs.
  Challenges in Neural Networks:
  - May not be applicable in some circumstances when the output is not
    
    871
    
    G.Vijaya,Dr.A.Suhasini
    
    clearly known.
  - ANNs are more complex and robust when compared to other regressions.
    
    How to overcome?
  - It is planned to use reinforcement training model to overcome this problem.
  - Reinforcement learning attempts to learn the input-output mapping through
  trial and error with a view to maximize the performance index.
4. ASSOCIATION RULE MINING
  
  Association Rule Mining (ARM) [15], is often stated as follows: Let I be a set of n binary attributes called items. Let T be a set of transactions. Each transaction in T contains a subset of the items in I. A rule is defined as the implication of
  
  the form
  
  The set of items X and Y are called antecedent and consequent of the rule respectively. A commonly given example from market basket analysis rule is meaning that customer who buys pencil also buy eraser.
  
  Association rule mining is popularly done with flag attributes, indicating the presence/absence of the item in the transaction. However, even from nominal attributes (having multiple but finite possible values), and numeric attributes, it is possible to derive flag attributes for the purpose of association rule mining.
  
  Association rules are mined in two step process consisting of frequent itemset mining followed by rule generation:
  
  Step 1: Searches for patters of attribute-value pairs
  
  that occur repeatedly in a data set, where each attribute value is considered an item. The resulting attribute value pairs form frequent itemset.
  
  Step 2: Analyzes the frequent itemset in order to generate the association rule. All association rules must satisfy certain criteria regarding their accuracy and the proportions of the data set that they actually represent.
5. BAYESIAN CLASSIFIER
  
  Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.
  
  The probability of a specific feature in the data appears as a member in the set of probabilities and is derived by calculating the frequency of each feature value within a class of training data set. The training data set is a subset, used to train a classifier algorithm by using known values to predict future or unknown values.
  
  5.1. NaÃ¯ve Bayesian Classifier
  
  NaÃ¯ve Bayesian classifiers [17], assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simply the computations involved and, is considered as, naÃ¯ve.
  
  The algorithm uses Bayesian theorem and
  
  G.Vijaya,Dr.A.Suhasini
  
  872
  
  assumes all attributes to be independent for the given values of the class variable. This conditional
  - independence assumption rarely holds true in real-world applications, hence the characterization as NaÃ¯ve yet the algorithm tends to perform well and learn rapidly in various supervised classification problems.
    
    This naivety allows the algorithm to easily construct classifications out of large data sets without resorting to complicated iterative parameter estimation schemes.
    
    Issues in Bayesian Classifier
    - Despite its simplicity, the NaÃ¯ve Bayesian Classifieris a robust method, which shows an average good performance in terms of classification accuracy, also when the independence assumption does not hold.
6. CONCLUSION

In this survey paper, different Lung Cancer prediction system is developed using the Data Mining classification techniques. The most effective model to predict patients

with Lung Cancer Disease appears to be NaÃ¯ve Bayes, followed by Association Rule Mining, Decision Trees and Neural Network. Decision Trees result are easy to read and interpret. NaÃ¯ve Bayes are far better than Decision Trees as it could identify all the significant

medical predictors. The relationship between attributes produced by

Neural Network is more difficult to understand.

References:

American Cancer Society. Cancer Facts & Figures 2012.
J.R. Quinlan. Induction of decision trees. Machine learning,1(1):81106, 1986.
J.R. Quinlan. C4. 5: Programming for machine learning. MorganKauffmann, 1993.
Han, J. and M. Kamber, 2001. Data Mining: Concepts and Techniques. San Francisco, Morgan Kauffmann Publishers.
V.Krishnaiah, Dr.G.Narasimha, Dr.N.Subhash Chandra, Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques, International Journal of Computer Science and Information Technologies, Vol. 4(1), 2013, 39-45.
B.Chandana, Dr. DSVGK Kaladhar, Data Mining, Inference and Prediction of Cancer datasets using Learning Algorithms, International Journal of Science and Advanced Technology, Vol. 1, No. 3, May 2011, 68-77.
C. Bishop, Neural Networks for Pattern Recognition, Oxford: University Press, 1995.
L. Fausett, Fundamentals of Neural Networks, New York, Prentice Hall, 1994.
Neha Sharma, Comparing the Performance of Data Mining Techniques for Oral Cancer Prediction, ICCS 11, Feb 12-14, 2011, Rourkela, Odisha, India, 433-438.
Penedo MG, Carreira MJ, MosquerraA, et al. Computer Aided Diagnosis: A Neural Network-based approach to Lung Nodule Detection, IEEE Trans Med Imag 1998; 17: 872- 880.
V.N. Vapnik. The nature of statistical learning theory, Springer, 1995.

G.Vijaya,Dr.A.Suhasini

873
Ankit Agrawal, Sanchit Misra et al., A Lung Cancer Outcome Calculator Using Ensemble Data Mining on SEER Data, BIOKDD 2011, Aug. 2011, San Diego, CA, USA.
M.Gomathi, Dr.P.Thangaraj, An Effective Classification of Benign and Malignant Nodules Using Support Vector Machine, Journal of Global Research in Computer Science, Vol. 3, No.7, July 2012, 6-9.
M.Gomathi, Dr.P.Thangaraj, A Computer Aided Diagnosis System for Lung Cancer Detection using Machine Learning Technique, European Journal of Scientific Research, Vol.51, No.2 (2011), 260-275.
R.Agrawal, T.Imielinski, and A.Swami, Mining association rules between sets of items in large databases, in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ser.SIGMOD 93, 1993.
Ankit Agrawal and Alok Choudhary, Identifying HotSpots in Lung Cancer Data Using Association Rule Mining, 11th IEEE International

Conference on Data Mining Workshops , 2011, 995-1002.
George Dimitoglou, James A.Adams, and Carol M.Jim, Comparision of the C4.5 and a NaÃ¯ve Bayes Classifier for the Prediction of Lung Cancer Survivability, 2010, Frederick, USA.
S.Vijayarani & S.Sudha, Disease Prediction in Data Mining Technique A Survey,

International Journal of Computer Applications and Information Technology, Vol. II, Issue I, Jan

2013, 17-21.
Muhammad Shahbaz, et al., Cancer Diagnosis Using Data Mining Technology, Life Science Journal, 2012; 9(1), 308-313.
Zakaria Suliman Zubi, Rema Asheibani Saad, Using Some Data Mining Techniques for Early Diagnosis of Lung Cancer, ACM DL, Recent Research in Artificial Intelligence, Knowledge Engineering and Data Bases, 2011, 32-37.

G.Vijaya,Dr.A.Suhasini

874

G.Vijaya,Dr.A.Suhasini

875

.

G.Vijaya,Dr.A.Suhasini

876

G.Vijaya,Dr.A.Suhasini

877

ICSEM - 2013 (Volume 1 - Issue 06)

Early Detection of Lung Cancer using Data Mining Techniques: A Survey

Early Detection of Lung Cancer using Data Mining Techniques: A Survey

Lung Cancer

Lung Cancer Types

Non Small Cell Lung Cancer (NSCLC):

Small Cell Lung Cancer (SCLC):

C4.5 Decision Tree

CART

Other types of Lung Cancer:

Issues in Decision Tree

ARTIFICIAL NEURAL NETWORK (ANN)

A Multilayer Feed Forward Neural Network (MFFNN)

Backpropagation Neural Network (BPNN)

Support Vector Machines (SVM)

Extreme Learning Machine (ELM)

Pros and Cons of Neural Networks:

ASSOCIATION RULE MINING

Issues in ARM

BAYESIAN CLASSIFIER

5.1. NaÃ¯ve Bayesian Classifier

Issues in Bayesian Classifier

CONCLUSION

References:

Leave a Reply