Unconscious Oral Cancer Prediction using Supervised Learning

Aarti Nayak; Samir Pol; Swaraj Singh; Anagha Patil

doi:10.5281/zenodo.18629561

Volume 10, Issue 03 (March 2021)

Unconscious Oral Cancer Prediction using Supervised Learning

DOI : 10.5281/zenodo.18629561

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 433
Authors : Aarti Nayak , Samir Pol , Swaraj Singh , Anagha Patil
Paper ID : IJERTV10IS030297
Volume & Issue : Volume 10, Issue 03 (March 2021)
Published (First Online): 06-04-2021
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Unconscious Oral Cancer Prediction using Supervised Learning

Aarti Nayak

Dept. of Information Technology Vidyavardhinis College of Engineering and Technology Vasai, Mumbai

Samir Pol

Dept. of Information Technology Vidyavardhinis College of Engineering and Technology Vasai, Mumbai

Swaraaj Singh

Dept. of Information Technology Vidyavardhinis College of Engineering and Technology Vasai, Mumbai

Prof. Anagha Patil

Dept. of Information Technology Vidyavardhinis College of Engineering and Technology Vasai, Mumbai

AbstractSince the beginning of time, the disease of cancer has been incurable and intimidating. However owing to the tremendous advancement in the field of technology, it is smoothly curable provided detected in the earliest of stage possible. Pre- cisely considering oral cancer, it is the phenomenon of exponential increase in the number of cells which in turn starts damaging the surrounding and neighbouring cells. In spite of the availability of supremely advanced radiation therapy and chemotherapy, the death rate prevailing is very disappointing and increasing. However an early prediction of the same might help to curb this problem. In order to provide with a substantial solution for the same, we propose to perform a comparative analysis of the supervised learning techniques under the domain of machine learning using the accuracy and time complexity approach to design an effective model using the considered data set of the victim to help predict the unconscious cancer in a user so that he/she can work towards the appropriate line of treatment and also make the suggested lifestyle changes. The aim of this paper is to act as a detailed guide for all to develop a system on similar guideline.

Index TermsSupervised Learning, Machine Learning, Oral Cancer, Prediction Model, accuracy, time complexity.

INTRODUCTION

Cancer is one of the deadliest diseases such that it ends up claiming millions of lives every single year all around the globe and amongst it, oral cancer is one such sub-type that is mostly triggered due to a few careless day- to-day activities of us human beings and for which we could have a control over. Unfortunately, every year approximately 2,00,000 deaths worldwide and 46,000 deaths particularly in India account for oral cancer. Statistically unlike other types, oral cancer is visible in the earlier stages on the surface of the mouth in the form of blisters, soft white spots, surfaces getting swollen and extremely red in colour, difficulty in swallowing, excruciating

pain in the throat and the mouth region and so on. Good dental or oral care is important to maintaining healthy teeth, gums and tongue. Oral problems, including bad breath, dry mouth, canker or cold sores, TMD, tooth decay, or thrush are all treatable with proper diagnosis and care. Oral cancer can affect any area of the oral cavity including the lips, gum tissues, tongue, cheek lining and the hard and soft palate.

Fig. 1. Affected areas in oral cancer

The following are the precise, elaborate and accurate symp- toms of the oral cancer, which the users need to pay attention to:
1. A sore or blister in your mouth or on your lip that does not heal after two weeks.
2. Lesion on the tongue or tonsil.
3. White and red patches in the mouth or lips that does not heal.
4. Bleeding from the mouth that is unrelated to an injury.
5. Change in the way teeth fit together, including how dentures fit or loose teeth because of jaw swelling or pain.
6. Difficulty swallowing, chewing, speaking, or moving the tongue.
If any of these uneasy symptoms are experienced by the user which could be confused by the user for some regular uneasiness, the computer trained models could help the user to actually verify the probability of actually succumbing to cancer and it would help the use to begin the effective direction of treatment the earliest.

According to the research performed by the multiple health organizations, the factors affecting and supporting the occurrence of the oral cancer are as follows:
1. Consumption of alcohol
2. Consumption of tobacco
3. Gender
4. Age
5. Poor nutrition
6. Immunity deficiencies
7. Viral infections
Owing to all of the factors stated above the user maybe have a larger probability to succumb to the oral cancer. Therefore it also stands important to apprise the user regarding the smaller and a few basic changes he/she could implement in order to have a better overall oral health.

Machine Learning: It is a methodology that includes designing of a model that continues to learn and teach and improvise itself, without any human intervention. There are 3 types of Machine learning techniques:
1. Supervised Learning: Here, the dataset is structured and there exist a certain fixed set of outputs for a fixed set of inputs.
2. Unsupervised Learning: Here, the data available is not structured, however, the output is discovered through a pattern.
3. Reinforcement Learning: A computer program interacts with a dynamic environment in which it must perform a certain goal.
LITERATURE REVIEW

Arushi Tetarbe and Tanushri Choudhury use WEKA Knowl- edge Explorer, a user-friendly GUI, which harness the feature of WEKA software. They have used one more interface in WEKA which has two methods – Explorer interface that devices the statistical knowledge and inference Experimenter interface that analyze the data efficiently by using training and test sets. J48, Random Tree, Naive Bayes, and REP Tree are some of the algorithms that are used. [1]. The survey of machine learning-based approaches was explored to understand the basic application of machine learning in biomedical research and came across different algorithms and few observations regarding cancer types and attributes [2].

Madhura V, Meghana Nagaraju with their companion survey different reports to study oral cancer detection using machine learning [3]. They then use classification rules for prediction and association rules for showing the co-dependence amongst the attributes. It then uses the apriori algorithm in order to select the frequent itemsets and form the association rule using a bottom-up approach i.e a breadth-first search and a hash to count the items efficiently [4]. The deep learning survival prediction report shows different approaches using the Cox Proportional Hazard regression method and the Random Survival Forest method [5]. Sandhya. N. Dhage primarily provides information regarding various techniques present in the domain of machine learning dividing the approaches into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning[6]. It also explains the machine learning algorithms including Decision Tree Algo- rithm, Artificial Neural Network Algorithms, Dimensionality Reduction Algorithms, etc. Machine learning applications are classified based on learning styles such as supervised and unsupervised learning. K. Lalithamani, A. Punitha in her work particularly uses Deep Neural Based Adaptive Fuzzy System (DNAFS) for accurate results in data mining techniques [7]. The process begins with the data processing and clustering using Fuzzy C – Means. The DNAFS design has been given and the use of evaluation metrics is done for accurate results such as precision, accuracy, etc. Lavanya L and Dr. Chandra J conducted the oral cancer analysis using machine learning techniques and explored th prediction of the cancer stage of a person [8]. It found out that the decision tree and random algorithm tree gives more accuracy. This analysis is very useful to find out the effect of cancer on the human body. More analytical points for oral cancer are understood from the other paper [9]. It uses Updateable Naive Bayes (NB), Multilayer Perceptron (MLP), SMO-Polykernel (E-1.0) (SVM), and K-Nearest neighbors classifier (lazy.IBk) for anal- ysis. Shikha Agrawal, Jitendra Agrawal explain more about neural networks in their survey [10]. Mrs. R. Vidhu, Mrs.

S. Kiruthika uses a combination of a genetic algorithm and apriori algorithm as a new feature selection method for better results [11]. It uses association rule mining which is applied to search the hidden relationship among the attributes. A soft computing technique was required for better prediction and understandability of oral cancer in an earlier stage. Zahraa Naser Shah Weli conducted a review of the last ten years to understand the machine learning techniques and its success in the prediction of different types of cancer [12]. Konstantina Kourou, George Rigas gives the idea of dynamic Bayesian networks for the prediction of oral cancer [13]. N.Anitha, K.Jamberi comes up with a more accurate algorithm for the diagnosis and prognosis of oral cancer using a classification algorithm [14]. SHARMA and OM give a detailed version for prediction in terms of age, gender, and socioeconomic status [15]. However, when we conducted a detailed study of the researched material available and is when we encountered the fact that these materials lagged in performing and researching on the available methods by any method and also, it did not

explain how can one build an entire model or a system to perform such predictive analysis.
PROPOSED METHODOLOGY

The system we intend to build, follows a very simple procedure of taking a few simple inputs from the user based on his/her day to day activities, life style, personal physical details and so on. All of this data collected than shall be tested on the basis of the trained model and the appropriate results shall be generated. The result shall depict the degree of severity of the cancer the user is prone to. The lifestyle and personal physical attributes causing the severity so that the user can make the changes accordingly and the correct line of treatment could be pursued. From the attribute values gathered a suitable machine learning algorithm shall be processed to find the relevant patterns.

Fig. 3. Usecase diagram of the application

place in training and testing parts, wherein a suitable machine learning algorithm is applied to discover useful patterns and retrieving the same, eventually leading to the construction of a classification model, to which we can supply the acquired inputs.[7]
Fig. 2. Flowchart indicating the process model

Above given is the usecase diagram for the system approach we have designed. The usecase diagram tells us how shall the system work. The steps for the same are as follows:

Step 1: If the user has already registered with the system can login in with the valid credentials and then carry on with the relevant objectives.

Step 2: If the person is not registered already, he/she can create an account, sign in and then perform various objectives related to the same.

Step 3: The user can view his/her profile with respect to different analytics related to their oral health that might the cause of the disease and also the actual chances of them getting oral cancer.

The steps that need to be followed and executed The flow of the procedure to be followed in order to build such prediction model is given above. According to chart, initially after taking the input, classification is performed using the best suitable algorithm subsequently leading to performance of feature selection. Furthermore, separation of data takes

RESULTS AND DISCUSSIONS

Give below are the steps that we have performed and executed as per the flow chart.

Existing Database

The data set obtained for the particular projects consists of 20 attributes of two types, namely the day to day lifestyle related attributes like the food intake and other personal habits and other ones being the bodily related attributes. A combination of these attributes and the pattern in which they vary, shall help us to put up the analytics we intend to.

Here, the attribute level is nominal in nature, the determining column which shall contribute in deriving the probability of the patient acquiring the cancer. The rest of the attributes are numeric in nature, with their values ranging in particular domain, as per their nature.

The detailed description of the dataset is as follows:
1. Patient ID: This attribute shall uniquely be able to identify the user and the same shall help the application to explicitly revive the records of the particular user. This happens to be a auto-generated feature.
2. Age: The age of the user has a little or no role in determining the probability of cancer, however the age can have an indirect affect on the other values of the user. Therefore, age is taken as one of the attributes as input from the user.
3. Gender: The input is taken from the user for the gender of theirs. We have included two choices for the user to choose from. The user can either be male or female.
4. Tobacco: This is a life style based input. This input is taken from the users on the basis of their tobacco consumption frequency. The scale of this attribute ranges from 1 to 7, where in these numbers represent the number days in a week the users consume tobacco. Where in 1 indicates the consumption being once in a week and 7 indicates the consumption being 7 days in a week.
5. Alcohol-consumption: This is again a life style based input. This input is taken from the users on the basis of their alcohol consumption frequency. The scale of this attribute ranges from 1 to 7, where in these numbers represent the number days in a week the users consume alcohol. Where in 1 indicates the consumption being once in a week and 7 indicates the consumption being 7 days in a week.
6. Viral-Infection: This attribute is health based. This indicates how vulnerable the user is to viral infections. The users have 4 options to choose from. Namely, none, rare, frequent, extreme. these levels are decided on the frequency of the user contracting viral infections.
7. Swollen-Tonsil: This input could be given by the users by choosing from two given choices, i.e. either yes or no. Implying if or not the user is experiencing any discomfort at all regarding the swollen tonsil.
8. Genetic-Risk: The genetic risk could be evaluated by the user by choosing from three choices that indicate the history of cancer in the users previous generations.
9. Bleeding-Mouth: This attribute is a symptom attribute that indicates the severity of the bleeding experienced by the user. The three types of inputs expected are, spotting, moderate and extreme.
10. Balanced-Diet: This is a life style based input. This input is taken from the users on the basis of their frequency of consumption of healthy balanced food. The scale of this attribute ranges from 1 to 7, where in these numbers represent the number days in a week the users consume a healthy diet. Where in 1 indicates the consumption being once in a week and 7 indicates the consumption being 7 days in a week.
11. Obesity: This attribute is calculated on the basis of the users BMI. Depending upon the users height and weight, the users are divided into 7 categories. With 1 being the lowest BMI and 7 being the highest.
12. Smoking: This is a life style based input. This input is taken from the users on the basis of their smoking frequency. The scale of this attribute ranges from 1 to 7, where in these numbers represent the number days in a week the users smoe up. Where in 1 indicates the consumption being once in a week and 7 indicates the consumption being 7 days in a week.
13. Passive-Smoker: This input could be given by the users by choosing from two given choices, i.e. either yes or no. Implying if or not the user is indulging into any kind of passive smoking activity.
14. Red-Spots: This attribute is health based. This indicates how dense spotting is the user facing. The users have 4 options to choose from. Namely, none, rare, frequent, extreme. these levels are decided on the density of the red spots concentration.
15. Coughing-Blood: This is again a health based attribute that indicates the frequency of the user experiencing blood in their cough. The users have 4 options to choose from. Namely, none, rare, frequent, extreme.
16. Fatigue: This attribute indicates the extent of fatigue the user has experienced. It could be none, the regular and the
  
  fatal unexplained one.
17. Weight Loss: This input could be given by the users by choosing from two given choices, i.e. either yes or no. Implying if or not the user is experiencing any unusual weight loss at all.
18. Swallowing-Difficulty: This attribute is health based. This indicates how difficult is the user finding it to swallow stuff. The users have 4 options to choose from. Namely, none, rare, frequent, extreme. these levels are decided on the difficulty of swallowing.
19. Dry-Cough: This input could be given by the users by choosing from two given choices, i.e. either yes or no. Implying if or not the user is experiencing any dry cough.
20. Level: This the final column on which the prediction model shall be built. This is not taken as an input from the user. This has 3 values, namely, Low, High, Medium. These levels shall be decided on the basis of the pattern observed in the rest of the attributes mentioned above.
Data – Pre-processing

This step involves the cleaning the data available for dis- crepancy, scaling it and making it ready for actually building the model. Sometimes, it might happen so that from the pool of the data available, some of the values might be missing or inherently incorrect or so. This type of data disrupts the functioning of the model and hinders the accuracy of the model. Therefore, elimination of the same is essential. And exactly the same is achieved in this step. For the dataset spoken above,we have carried out the following pre-processing steps that were needed:
1. Null and missing data: For a few attributes, a few tuples had either null or missing values, in order to deal with this, we filled the missing spaces with the mean of the rest of the values of the other tuples.
2. Encoding the columns: In order to deal with the categorial data, we used the label encoder system in order to convert the data into respective data into the corresponding labels in order to facilitate easier training of the model.
3. Scaling the attributes: For the dataset we had a few columns that had a huge difference in their range of values which could cause an issue in training the model, therefore, we scaled a few attributes in correspondence to each other.

Feature Selection

Feature selection is the process, in which we select a subset of features from all the existing features depending upon their co-relation to the final and deciding column. In order to carry this step out, we implemented the same using two methods:

Plotted a Correlation Matrix.

Calculated the Covariance values.

Correlation Graph: It a technique of plotting attributes against each other, where in, owing to the correlation value of the attributes, we can be able to decipher the correlation between different attributes available in the dataset and that helps us understand how these attributes are related to each other and to what extent.

Fig. 4. A matrix depicting the correlation between difference attributes

Attribute	Correlation Value
Tobacco	0.64
Alcohol	0.73
Viral-Infection	0.71
Swollen-Tonsil	0.67
Genetic-Risk	0.70
Bleeding-Mouth	0.61
Balanced-Diet	0.71
Obesity	0.83
Smoking	0.52
Passive-Smoking	0.62
Red-Spots	0.65
Coughing-Blood	0.78

TABLE I

ATTRIBUTES AND THEIR CORRELATION VALUES

The graph plotted above is an attempt to derive the relation between the 20 attributes we have covered in our dataset. The above drawn figure is a Heat Map, that allows to understand the heated region, the area in orange, where the correlation is thickening.

Covariance Values: Covariance values depict how the values from the target column vary with the values of the the attributes that go into the making of the model. It depicts the linear variance of the output variables with the intended features. From the entire set of data we have, the attributes with the highest values of covariance in their descending order are given in the table. Here we have used the Chi Square formula to obtain these values of covariance. After implementing both these methods an intersection of attributes from both these outputs are selected.

Training the model

In order to train the model for producing the relevant predictions, we can opt to two ways; namely a Classification model or a Regression model. However, for our work we have chosen to train our model we have chosen to build a classification model, since, our deciding column has distinct categorial values. Therefore, we have opted for a natural choice of a classification model to classify the final column(Level) into either High, Medium or Low. For building the classification model, we focused on not only achieving a solution but also on obtaining an

Fig. 5. A barplot depicting the attributes and their Chi Sq. Covarinace scores

Attribute	Covariance Value
Tobacco	818.668884
alcohol-use	781.909841
passive-smoker	752.959791
obesity	712.087562
smoking	671.006253
balanced-diet	588.933743
chest-pain	524.489521
fatigue	518.900446
air-pollution	518.631533
genetic-risk	488.649726
occupational-hazards	415.685654
dust-allergy	401.040883
shortness-of-breath	330.880709
chronic-lung-disease	302.396157
clubbing-of-finger-nails	257.907679
weight-loss	206.666563
wheezing	201.426189
frequent-cold	192.713276
dry-cough	152.029547
swallowing-difficulty	113.074249

TABLE II

ATTRIBUTES AND THEIR COVARIANCE VALUES

optimized solution. In order to achieve the optimized result for our model, it was essential for us to use the best fit of the algorithm, for which we carried out he comparative studies of the algorithms available for supervised learning. For starters, the supervised learning technique in the machine learning methodology deals with developing a model where for a particular set of inputs, there exists a fixed set of outputs corresponding to the inputs. In order to achieve this, we have studied four supervised algorithms namely, K Nearest Neighbours, Decision Trees, Naive Bayes and Random Forest.[8]

In order to train the model for the supervised learning, we have divided the dataset into the 80:20 ratio. Where in, we have used 80 percent of the data[800 tuples], to train the model, and the remaining 20 percent data[200 tuples] to test the dataset for. In order to select the one that shall suit the best given the selected dataset and provide the most accurate result, we have implemented the accuracy scores understood through the Confusion Matrix. A Confusion Matrix is a matrix plot that tells us the number of instances correctly and incorrectly classified by a particular algorithm.

The confusion matrix for Naive Bayes algorithm is as follows:

Fig. 6. Confusion Matrix for Naive Bayes Algorithm

For the Naive Bayes algorithm: True Positives: 178/200

False Positives: 22/200 Percentage Accuracy: 89 percent

Fig. 8. Confusion Matrix for Decision Tree Algorithm

For the Decision Tree algorithm: True Positives: 200/200

False Positives: 0/200

Percentage Accuracy: 100 percent

Fig. 7. Confusion Matrix for KNN Algorithm

The confusion matrix for KNN algorithm is as above: For the KNN algorithm:

True Positives: 200/200 False Positives: 0/200

Percentage Accuracy: 100 percent

The confusion matrix for Decision Tree algorithm is as above:

Fig. 9. Confusion Matrix for Random Forest Classifier Algorithm

The confusion matrix for Random Forest Classifier algorithm is as above:

For the Random Forest Classifier algorithm: True Positives: 200/200

False Positives: 0/200

Percentage Accuracy: 100 percent

From our analysis we understood that except for the Naive

Algorithms

Training

Prediction

Decision Tree

Random Forest KNN

TA

O(n2p)

O(n2pn)

– BLE III

O(p)

O(pn)

O(np)

Algorithms

Training

Prediction

Decision Tree

Random Forest KNN

TA

O(n2p)

O(n2pn)

– BLE III

O(p)

O(pn)

O(np)

DongWook Kim , Sanghoon Lee, Sunmo Kwon, Woong Nam, In-HoCha Hyung Jun KimDeep learning-based survival prediction of oral cancer patients 2019.
Sandhya N. Dhage A Review on Early Detection of Oral Cancer using

ALGORITHMS AND THEIR TRAINING AND PREDICTION COMPLEXITIES

ML Techniques 2019.
K. Lalithamani, A. Punitha

Detection of Oral Cancer using Deep Neural

Bayes algorithm, all the other algorithms gave us a 100 percent accuracy. Therefore, in order to select and algorithm from the other three algorithms, we took into consideration the time complexities of the algorithms, in order to get an idea of the amount of time required by the algorithms to render the output. From these studies, we discovered that, the complexities of the algorithms were elicited in the table:

From the table it is evidently clear that the De- cision Tree algorithms needs the least of the time period to render and produce the output. Therefore, based on the accuracy(100 percent) and time complexity o[n2p, p]wedecidedtouseDecisionT reetotrainourmodel.

And using these steps, we successfully trained our model.

LIMITATIONS

The limitations of our approach are as follows:
CONCLUSION

The above given detailed description of the procedure to be followed is based on the the supervised learning methodol- ogy,and we selected this method because it was best suited for our dataset; as, the data was labelled. However, as we worked on this project, the advantages and disadvantages of using this approach are summarized as follows:

Advantages: 1. The model provides fast and efficient outputs for the data that is labelled and has a consistent set of outputs for corresponding inputs.

2. In supervised learning, we have an exact idea about the classes of the objects.

Disadvantages: 1. It is not suitable to handle complex tasks.

The training requires a lot of computational times.
Might not provide accurate answers for the inputs that deviate a lot from the training dataset.

REFERENCES

Arushi Tetarbe, Tanupriya Choudhary, Teoh Yiek Toe, Seema Rawat Oral Cancer Detection using data mining tool 2017 IEEE.
Ajay Kumar, Rama Sushil, Arvind Kumar Tiwari Machine Learning based Approaches for Cancer Prediction: A Survey 2019.
Madhura V, Meghana Nagaraju, Namana J, Varshini S, Rakshitha R Survey Paper on Oral Cancer Detection using Machine Learning 2019.
Madhura V, Meghana Nagaraju, Namana J, Varshini S, Rakshitha R Oral Cancer Detection Using Machine Learning 2019.

Based Adaptive Fuzzy System in data mining techniques 2019.

Lavanya, Dr. Chandra J Oral Cancer Analysis Using Machine Learning Techniques 2019.
Fatihah Mohd, Noor Maizura Mohamad Noor, Zainab Abu Bakar, Zainul Ahmad Rajion Analysis of Oral Cancer Prediction using Features Selection with Machine Learning 2015.
Shikha Agrawal, Jitendra Agrawal Neural Network Techniques for Cancer Prediction: A Survey 2016.
Mrs. R. Vidhu, Mrs. S. Kiruthika A New Feature Selection Method for Oral Cancer Using Data Mining Techniques 2016.
Zahraa Naser Shah Weli Data Mining in Cancer Diagnosis and Predic- tion: Review about Latest Ten Years 2020.
Konstantina Kourou, George Rigas, Konstantinos P. Exarchos, Costas Papaloukas and Dimitrios I. Fotiadis Prediction of Oral Cancer Recur- rence using Dynamic Bayesian Networks.
N.Anitha, K.Jamberi Diagnosis and Prognosis of Oral Cancer using classification algorithm with Data Mining Techniques
Sharma N., Om H. Using Data Mining For Oral Cancel Risk Stratifi- cation in terms of Age, Gender And Socioeconomic Status 2013.
Konstantina Kourou a, Themis P. Exarchos a,b, Konstantinos P. Exarchos a, Michalis V. Karamouzis c, Dimitrios I. Fotiadis Machine learning applications in cancer prognosis and prediction 2015.

Unconscious Oral Cancer Prediction using Supervised Learning

Leave a Reply