Detection and Prediction of Frequent Diseases in India through Association Technique using Apriori Algorithm and Random Forest Regression

Data Mining is a process of analyzing huge data from different perspectives and summarizing it into useful information. The information can be converted into knowledge about historical patterns and future trends. Health care industry generates large amount of complex data about patients, hospitals resources, diseases, diagnosis methods and electronic patients’ records. The data mining techniques are very handy to take appropriate medicinal decisions in curing diseases. The healthcare data can be “mined” to discover hidden information or patterns for effective decision making. The discovered knowledge can be utilized by the healthcare administrators to improve the quality of service and provide better facilities to the patients. Mining of frequent diseases helps clinicians to take better diagnosis decisions in curbing the occurrence of these diseases in the society. A systematic approach with the association rule based Apriori data mining technique is proposed in this paper to identify frequency of diseases. Further, prediction the frequent ones in a particular geographical area at a given time period has also been given a look through Random Forest Regression so as to be a boon to the lives of patients in remote areas.


I. INTRODUCTION
Computer Science is a field which consists of various techniques and technologies related to Data Science, Artificial Intelligence, Internet of Things and Graphical Visualization. Data Mining can be described as the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable pattern in data with the wide use of databases and the explosive growth in their sizes [1]. Data mining refers to extracting or "mining" knowledge from large amounts of data. Data mining is the search for the relationships and global patterns that exist in large databases but are hidden among large amounts of data. The essential process of Knowledge Discovery is the conversion of data into knowledge in order to aid in decision making, referred to as data mining. Knowledge Discovery process consists of an iterative sequence of data cleaning, data integration, data selection, data mining pattern recognition and knowledge presentation.
Data Mining consists of four major functionalities which include Pre-processing, Association, Classification and Clustering. Association is a data mining function that discovers the probability of the co-occurrence of items and interesting patterns in a collection. Association is a process of generating frequent item sets based on few thresholds. The relationships between co-occurring items are expressed as association rules. The association technique works based on three metrics, they are support, confidence and lift. Attributes such as the state name, disease, number of cases in a particular time period are used to find the frequent diseases in a particular geographical area. This paper proposes a methodology that utilizes Association technique to generate interesting patterns from the health data. We use the Apriori Algorithm to mine the frequent diseases across each state in India based on the minimum support threshold. The paper is organized as follows: Section I of the paper deals with introduction; Importance of health data is described in Section II, in Section III categories of diseases in India are presented; Apriori algorithm and Random Forest Regression is presented in Section IV; Section V deals with the methodology together with experimental results and finally the conclusion is presented in Section VI.
II. IMPORTANCE OF HEALTH DATA Health data is considered as one of the most dynamically generated data [2]. A gradual increase in the number of health cases registering across the world every day can apparently be observed. This data is considered as most gripping and challenging dataset collection to be mined and is used to obtain interesting and hidden patterns. Mining of health data and prediction of future condition not only helps the clinicians and medicos to improve their quality of service in treating the patients but also in determining the futuristic ailments.
Knowledge discovery from health data helps to identify frequent diseases in a particular geographical area at a given time period to provide a better understanding of the root cause of the complaint. This gives a clear picture to the researchers and helps them to identify different aspects triggering the particular disease, so that they can uproot it from the origin. If this is done, proper vaccines and medicines can be used and awareness can be spread among public to take necessary precautions to prevent the diseases so that the epidemics/pandemics can be efficiently controlled. This surely reduces the mortality rate and helps the society to lead a healthy life.
India has witnessed remarkable progress in the health status of its population. However, over the past few decades, there have been major transitions in the health care field that had serious impact on health of patients. Changes have come up in economic development, nutritional status, fertility and mortality rates. Consequently, the disease profile has changed considerably. Communicable diseases such as Malaria, Kala-azar, Dengue, Chikungunya and Acute-Encephalitis have been recorded as the most frequently occurring diseases in the country. Though there have been substantial achievements in controlling communicable diseases, they still contribute significantly to disease burden of the country. Decline in mortality from communicable diseases have been accompanied by a gradual shift to the prevalence of chronic noncommunicable diseases (NCDs) such as cardiovascular disease (CVD), diabetes, chronic obstructive pulmonary disease (COPD), cancers, mental health disorders etc.

III. DISEASES PREVALENT IN INDIA
India is the second most populous country in the world and with almost one-fifth of the world's population living in India, the health status and the drivers of health loss is expected to vary between different parts of the country and between the states. Accordingly, effective efforts to improve population health in each state require systematic knowledge of the local health status and trends. The diseases which tend to occur in the states can be broadly classified into two types: Communicable and Non-Communicable diseases.

A. Communicable Diseases
The authentic data by The National Health Profile 2016, 2017, 2018 and 2019 reports the following diseases:  MALARIA Malaria is a mosquito-borne infectious and communicable disease with symptoms which typically include fever, tiredness, vomiting, and headache. This disease is spread by single-celled microorganisms of the Plasmodium group and is most commonly spread by female Anopheles mosquito [3].
 KALA-AZAR Visceral leishmaniasis (VL), also known as kala-azar, is the most severe form of leishmaniasis and, without proper diagnosis and treatment, is associated with high fatality. This disease is caused by protozoan parasites of the genus Leishmania.
 CHIKUNGUNYA Chikungunya is an infection caused by the chikungunya virus (CHIKV). Symptoms include fever and joint pains. Other symptoms may include headache, muscle pain, joint swelling, and a rash. The disease is caused by two mosquito breeds namely Aedes albopictus and Aedes aegypti [4].
 DENGUE Dengue fever is caused by the dengue virus. The symptoms may include a high fever, headache, vomiting, muscle pains, joint pains, and a skin rash. Dengue is spread by several species of female Aedes mosquito, mainly A. aeygpt.
 TYPHOID Typhoid fever is a serious disease spread by contaminated food and water. Symptoms of typhoid include high fever, weakness, stomach pains, headache, and loss of appetite. This is a bacterial infection spread due to Salmonella.
 PNEUMONIA Pneumonia is a lung inflammation causing high respiratory problems such as cough, fever, chills and difficulty in breathing.

 ENCEPHALITIS
Encephalitis is an inflammation of the brain which is mostly caused by Viruses. Various viruses causing encephalitis include herpes viruses, West Nile, Japanese encephalitis, and tick-borne viruses.

 JAPANESE ENCEPHALITIS
This disease is caused by Japanese encephalitis virus and is spread by Culex mosquitoes and they bite mainly during the night or just after sunset.
The following pie-charts depict the morbidity rates from the years 2017 and 2018 [5].  CARDIOVASCULAR DISEASES Cardiovascular disease (CVD) involves the heart or blood vessels. Other CVDs include stroke, heart failure, hypertensive heart disease, rheumatic heart disease, cardiomyopathy, abnormal heart rhythms, congenital heart disease, vascular heart disease, carditis, aortic aneurysms, peripheral artery disease, thromboembolic disease, and venous-thrombosis [7].
 CANCER Cancer, also called malignancy, is an abnormal growth of cells. There are more than 100 types of cancer, including breast cancer, skin cancer, lung cancer, colon cancer, prostate cancer, and lymphoma. Symptoms vary depending on the type [8].
 CHRONIC RESPIRATORY DISEASE Chronic respiratory diseases are chronic diseases of the airways and other parts of the lung. Some of the most common are asthma, chronic obstructive pulmonary disease (COPD), lung cancer, cystic fibrosis, sleep apnoea and occupational lung diseases [9].  DIABETES Diabetes is a chronic disease that occurs either when the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces. Insulin is a hormone that regulates blood sugar. This damages especially the nerves and blood vessels.

A. Apriori Algorithm
Apriori algorithm finds its place in the association functionality of data mining where this algorithm is especially used for mining frequent item sets from a large dataset. The major advantage of this algorithm over other association algorithms is that, Apriori algorithm can work on large datasets and also is quite easy to understand and implement. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database. This algorithm is being applied in domains such as Market Basket Analysis and Hospital Information Systems. Apriori algorithm works on the Apriori property which says  An item set is frequent when its subsets are frequent  An item set to be infrequent if its supersets are infrequent So, this becomes a great way of pruning the dataset in extracting the frequent trends in the same. The Apriori algorithm does the two following operations:  Join Operation: Candidate itemset is generated by joining the previous frequent itemsets with itself.
 Prune Operation: Any itemset that is not frequent cannot have a subset which is frequent (or) any itemset that is not frequent cannot be a super set of a frequent itemset.
Apriori algorithm is the most classical and important algorithm for mining frequent item sets. Based on the Apriori principle any subset of a frequent itemset must also be frequent. Example: if {XY} is a frequent itemset, both {A} and {B} must be frequent item sets [10]. The main idea of Apriori algorithm is to make several passes over the database. It employs an iterative approach known as a breadth-first search. In each subsequent pass, we begin to add the items into an itemset, called as candidate itemset.
At the end, we determine which of the candidate item sets are actually large or frequent, until no more frequent itemsets can be found. Hence the above theory suggests that Apriori follows bottom-up approach in scanning and pruning the dataset. The steps in applying the Apriori algorithm to a dataset are given below: Step 1: Consider a dataset consisting of 'n' number of transactions.
Step 2: Now calculate the minimum support count based on the transaction count.
Step 3: When the algorithm is applied on the dataset, the items whose support count is greater than or equal to the minimum support threshold are added into the candidate item set in each of the iteration until the candidate set gets an empty value, that is, 'φ' value.
Step 4: Having found all the frequent item-sets, the algorithm gets terminated at this point. V. METHODOLOGY AND RESULTS The orientation provided by our work can be very competent in finding the frequent diseases in a particular geographical area and predicting the trend of the cases in the future. This leads to improved decision-making and treatment as India has ample evidences of impacts attributed to mismatch between disease burden and its casual factors. Thus, interventions adopted for treatment and priorities in resource allocation can also to be put into their appropriate places. This paper deals with the demography of India, which is associated with a great fluctuation in anamnesis of the nation with a vision of predicting the frequently occurring diseases at various locations.

A. The Dataset
The authentic instances of the dataset have been derived from a series of editions published as the National Health Profile by the Central Bureau of Health Intelligence, Government of India, along with the collaboration of World Health Organization (WHO), various Central Ministries and all the state/union territory health departments. The data depicted in these editions provide comprehensive information related to health sector. These issues of the National Health Profile from 2015-2019 provide vital information on all major health sector related indicators, demography, socio-economics, health status, health finance, health infrastructure and human resources for the specified calendar year. The dataset consists of seven attributes:  Total number of cases Each tuple shows the trend of the number of cases reported of a particular disease.

B. Proposed Work
An updated and reliable health database is the foundation of decision-making across all health system building blocks. This is essential for health system policy development and implementation, governance and regulation, health research, human resources development, health education/training, service delivery and financing. To achieve some of these objectives, Central Bureau of Health Intelligence collects data from various sectors, ensuring their overall quality, relevance and timeliness. Data is then converted into relevant information to support planning, management, and decision making.
There are totally 27 diseases in the dataset whose frequency across India's 29 states and 7 union territories have been deducted.
The methodology of our proposed work follows the steps below: Step1: First, the cumulative number of cases for each state and union territory has been calculated based on the total number of cases attribute from the dataset.
Step2: This cumulative number of diseases can be considered as the total number of transactions registered for a particular state (or) union territory (UT).
Step3: Now the minimum support threshold is set. Since the dataset is quite large in this case, the minimum support threshold is taken within the range of 1% to 5% of the total number of transactions. Formula for the same is taken as:

Min.
Support (State/UT) = (0.01/0.02/0.03/0.04/0.05)*total_transactions of State/UT Step4: All the states and union territories need not have the same health indicators such as demography, climatic conditions, health finance infrastructure; accordingly, the occurrence of the diseases may vary from state to state. That is the reason why a unique minimum support threshold is considered for each state and union territory so as to identify the epidemics that are recurrent to that particular place.
Step5: After the minimum support threshold is set, the diseases whose total number of cases are greater than or equal to the minimum support threshold can be added into the candidate set as a frequent disease. This process is done likewise for all the other 26 diseases to find out the frequent diseases in that particular State/UT and with the help of user-interactive window, the user can select the state whose frequent disease he/she would like to view along with the for the same. Another window of the same purpose is used to display the graphs of diseases which show the areas of India in which that disease has been reported. In Andhra Pradesh frequently occurring diseases are ===> Acute-Diarrhoeal-Diseases, Typhoid, Acute-Respiratory-Infection, Diabetes, Hyper tension  Step6: The purpose of finding the frequent diseases is served by predicting the number of cases that would occur in the future based on the previous trends or values with the help of Random Forest Regressor. Random Forest Regressor has been implemented specifically for our work as this regressor does not overfit the training samples thus improving the accuracy or the r 2 score which is used to measure how close the data is fitted or predicted. This can be quoted as an extension to the concerned work. Step7: The diseases that are infrequent and the diseases that are frequent at a very few areas of India are also listed out. This helps to identify diseases whose rate of occurrence is quite less when compared to others and can be curbed or controlled in the very budding stage itself. V.III. FUTURE SCOPE The proposed method is useful to identify the frequent diseases in a large medical dataset collected from an authentic source in India during a stipulated time period from 2015-2018. The end result of this research will help the clinicians and other decision-making forces in making medical decisions for frequently occurring diseases. The predictions and analysis made can be furthered to classify the diseases into three to four classes based on the rate of occurrence of the diseases and take appropriate decisions as to control the problems faced by the patients and the society. Still yet, this work can be extended to work on unexpected pandemics and predictions based upon which a country can be ready to face the invisible enemy.
V. CONCLUSION In this paper, a methodology has been presented for analysing the health data of diseases in different areas or states of India. The dataset taken up for consideration is examined and considered for utilization as per the requirements. By using association principle through Apriori algorithm through the steps described, the frequent diseases in different areas has been obtained and presented in different angles.
The ultimate idea behind all this process is after discovering the frequent diseases in different states or UTs in India, necessary steps to be taken for their prevention in the coming years and keeping the concerned medicines ready for them before the next year or season. This apparent prediction can be a great help for the local medicos and most importantly, for the new patients in remote areas. This can be a great help for the society and protect the patients in remote areas by offering the exactly required medicines, thus prevent deaths. This problem recurs every year; by utilizing this idea, it can be tackled in the best way possible. Finally, the real idea of the work is to protect the people as best as possible and help the local governing bodies and hence, the society.