Twitter Sentiment Analysis with Diabetic Drugs Using Machine Learning Techniques with Glowworm Swarm Optimization Algorithm

Twitter is a Social Media that distributes opinions and sentiments of the people in day today life to friends and general public. Many people use twitter to communicate their side effects/benefits of diabetic medicines. Other people in turn seek these posts to gain feedback regarding their own Adverse Drug Reactions(ADR). Opinion mining of twitter data is an area that has experienced enormous growth in the last decade. For this purpose various Machine Learning(ML) Techniques and tools have been created. In this paper ML techniques were used in opinion analysis for processing information about ADR on taking diabetic drugsMetformin(generic and brand name). The aim of this paper is to identify the optimal ML algorithm. Glowworm Swarm Optimization(GSO) is used to derive the optimal feature selection and is combined with various classification algorithms namely Naïve Bayes(NB), K-Nearest Neighbor(KNN) and Support Vector Machine(SVM). The Experimental result shows GSO+SVM combination proved maximum accuracy of 94%. Keywords— Twitter data, metformin, GSO, Naïve Bayes, KNN, SVM,ADR


INTRODUCTION
Diabetes Mellitus is a metabolic disease in which the person's blood sugar levels are increased. There are two types of Diabetes Mellitus(DM) type I and II. Type I DM afflicts children and adolescents and requires insulin therapy. Type II DM occurs in older age population and requires oral anti diabetic drugs or insulin and Oral Anti Diabetic drugs(OAD) in combination. Each OAD has its benefits as well as adverse drug reactions. Hence it is essential for anti diabetic drugs to undergo pharmacovigilance in detecting ADR and there by helping physicians in preventing avoidable harm to diabetic patients [20]. Metformin(Generic and Branded) is a commonly used drug for type II DM. This study focuses on opinion of patients taking metformin regarding the ADR attained from twitter messages.
Twitter is a social media which is used by people to communicate about their health concerns and share their experiences. The number of messages shared in twitter is massive in view of the drug benefits and ADR making it an ideal resource for pharmocovigilance and early intervention [1,2]. Intelligent systems need to be created which will be accessible to patients to become aware of ADR of DM drugs from the Patients review. Fig 1 depicts the Architecture of the work flow.
In this article Section II deals with Related Works in Classification and Feature Selection Methods. Section III deals with Data Collection, Preprocessing, Feature Extraction, Feature Selection methods and ML classification algorithms. Section IV deals with data set description, Performance Evaluation and Comparative analysis of ML algorithms. Section V deals with the Conclusion. Fig1. System Architecture II. RELATED WORKS While a lot of research has been completed to analyse tweet sentiments by developing techniques of ML, some of the works are described in this section. Movie reviews are taken from twitter message and are analysed using different machine learning algorithms. They are differentiated into Positive, neutral and negative. Emojis and symbols which express sentiments are excluded in this study. Like wise words with repetition of letters to express sentiments are also ignored in this study [Tripathy et al] [2]. Sentiment Analysis of Arabic micro blogs has been done using deep learning systems-LSTM and GRU. Forward and backward direction has been tested with LSTM and GRU [Moslmi et al] [3]. Various feature selection methods and classification algorithms have been used for Malay sentiment classification. In this SVM classifier with information gain base feature selection method proved best performance with accuracy of 85.33%[Azani et al] [4]. Hager et al [21] study deals with prediction of heart disease from real time medical data regarding the people's present health status. Various machine learning algorithms are used to identify the optimal predictor. Random forest classifier proved to have the best performance in predicting heart disease with an accuracy of 94.9%, thus helping the patients in preventing cardiac catastrophe.

A. Data Collection
Twitter dataset about metformin and related branded medicine were used in this research to build classification model. Twitter offers streaming Application Programming Interface (API) to permit the users to gather real time data. This is a tool which creates the communication among computer programs and web services easy. Several tools such as python, JavaScript, and R-tool services are developed to relate with twitter network also to access data in efficient way. Here, R-Tool utilized to search for tweets posted recently by users to extract real time tweets. Before collecting tweets, the drug names essential to be defined as the 'keywords' in tracking and gathering data is stored into the .csv file. The below figure illustrate the process of data collection.

Fig2. Dataset Retrieval process
Getting Twitter API keys (API key, API secret, Access token and Access token secret) are important steps to access the Twitter Streaming API. Library files called 'Tweepy' is utilized to assess Twitter Streaming API and retrieving the twitter data. Primarily, drugs should have been on the market to treat substantial diseases, so that adequate tweets would occur for calculating their impacts. The tweets data set was retrieved with drug keyword also the tweets gathered from across the world [13].

B. Pre-Processing
Applying text preprocessing steps before analyzing the tweets is very important for achieving good results. There are several steps involved in the preprocessing stage such as URL Removal, Punctuation Removal, User name removal, Letter casing, Tokenizing, Stop word removal, Stemming and Lemmatization, to make a standard dataset [14]. Once the steps are completed, this research moves to the next main method called feature extraction. Extraction of valuable words from the tweet is called as feature extraction.

C. Feature Extraction
In the Feature Extraction model, the only few selected words are identified as features which have opinion (side effects) about the metformin by their presence in the dataset. Common side effects associated with metformin (declared by WHO) are taken in this work for feature extraction process. They are illustrated in the below figure,

Fig3. Common Side effects of Metformin
For the twitter data analysis, twitter platform has been taken as the main source in which attributes such as the Favourite, Favourite count, Truncated, Re-tweet, isRetweet, Count of retweet were taken as the inputs. These attributes are the major constraints of the twitter platform which is normally taken for the study and it reflects the nature of message sharing logic. To identify the effects on the individuals, these attributes normally reflects the usage of medicine through its messages. Pseudo code used to represent the class label is as follows:

Pseudo code:
Consider the twitter messages such as 'X' medicine(Generic and Branded) will have 'Y' side effects If message is favorite && favorite count =count+1&& Retweets= retweets +1 (If Retweet count+1 && isRetweet ==true ) Then number of retweets represents the affected people and they will be experiencing same side effects Else if Number of retweets represents the affected people is less and they will Be experiencing same side effects Else Not affected Favourite represents the person who has more side effects or less side effects on the particular medicine will be liked by other persons. It shows the resemblances of likeminded people. Retweet is the repetition of another user's tweet. It Confirms the same side effects experienced by the people or nil side effects for the medicine.
The label is numbered as '1' if Favourite and Retweet count and Favourite count are equal to or greater than one else it is numbered as '0'. In this way, python codes has been programmed to extract the values based on the medicinal messages which has been posted in the twitter. These attributes represents the impact of side effects on the particular individuals through their sharing and marking as the favorites. Hence the feature engineering adopted deals with the count values, retweets mechanism which are marked as the different identifier so that it can be used for successful identification and classification of side effects.

D. Feature Selection
Feature selection technique is used in sentiment analysis that has a significant role for identifying relevant features and increasing classification (machine learning) accuracy. Glowworm Swarm Optimization is a Feature Selection technique which changes the biology system of glowworm luminescence into Arithmetic representation. The theory of GSO is: 'n' number of glowworms will be available in the search space. Initially each glowworm will have a specific Luciferin intensity value. During the movement in the search space the Luciferin intensity value changes. The glowworm with high intensity attract the glowworm with lower intensity. Glowworm will be grouped together when they fall within the circular range. Other glowworms which falls outside the perception range will be omitted. The rule of glowworm search method appears in Figure 1, where the three glowworm entities are 'a', 'b', and 'c'. Their inquiry radii are 'ra', 'rb'and 'rc'. The search radius of 'a' is bigger than of 'b', and 'b' situates inside the pursuit scope of 'a'. In the event that the luminosity of 'a' is more grounded than that of 'b', the last will move towards the previous. Since 'c' is not inside the pursuit scope of 'a', neither of them will move towards one another paying little mind to whose brightness level is stronger [15]. Fig4. The searching rule plan of 'GSO' As stated by bionic principle, the two major characteristics of GSO are luminance and attractive degree. The below formulas describe the glowworm luminescence properties. Relative fluorescence luminance is denoted as (1) Where, 'I0' is the luminosity of glowworm at the position of =0. Higher luminance can produce best value of target function. 'λ' is a constant that denotes 'light intensity absorption' coefficient, which depicts the reduction degree of fluorescence vigour. 'γij' is the distance among glowworm 'i' and 'j'. The attractive degree is meant as Where, 'β0' is the attractive degree at the point of 'r=0', specifically the extreme attractive degree. The below formula is used to calculate the location update ( .
From the above equation, 's' denotes step size factor that set as a constant. The value interval is [0, 1]. 'xi' and 'xj' are spatial positions of glowworm 'i' and 'j'. 'rand' denotes random element and its interval value is [0, 1].

E. Machine Learning Algorithms for Classification
ML play a vital role in opinion classification. This section will discuss the some ML classification algorithms such as NB, KNN and SVM and demonstrates all classification algorithm's characteristics and working methodology.

Naïve Bayes (NB) Algorithm
NB is a robust Machine Learning classifier, which is utilized for classification process. This method is sustained from Bayes theorem where foundational theory of 'NB' classifier is constructed on the independence theory. Naïve Bayes classifier presumes that the outcome of a particular attribute in a class is independent of other attributes [16]. The predictions in the bayes theorem is evaluated using the following formula.
(4) P(h) is the probability of hypothesis h(Prior Probability) being true. P(D) is the probability of Data(Prior Probability) regardless of the hypothesis. P(h/D) is the Probability of hypothesis h(Posterior Probability) given the data D and P(D/h) is the probability of data D(Posterior Probability) given that hypothesis h was true. The classes of dataset can be easily predicted in given dataset by applying the NB classifier. Multi-class prediction is also possible. When the presumption of independence is well founded, NB is most capable than the other algorithms. Additionally, it will need less training data. If the absolute variable fits to a class that was not supervised in the training set, then the model will provide it a probability of '0' that will prevent it from creating predictions. It utilises independence among its features. In actuality, it is hard to collect data that are entirely independent features.

K-Nearest Neighbor (KNN)
KNN algorithm takes a crucial part in machine learning process. It belongs to the supervised learning area and have numerous applications in intrusion detection, pattern recognition, and so on. These KNNs are applied in real life situations where non-parametric methods are needed. This technique do not create any presumptions about data distribution. In the dataset, the KNN method classifies the coordinates into clusters which are recognized by a specific character. The single idea for this method is that it is similar output for similar training samples. For the input population nearest value is identified that is ready to assign classes to all or any of the samples. Consider and the sample population, thus to measure the similarity between them and the distance is calculated as given below.
In the above equation, Euclidean distance is described that evaluates similarity among two pixel points. Hence, the pixels obtain the category to which a number of them commonly resemble [17]. KNN is an Instance based learner that means it does not learn anything in the training stage. This technique does not get any discriminational function from the training data. Especially, there is no training time for this technique. It learns the characteristics of features from training dataset at the time of predictions. The training time is reduced in the algorithm so it is more rapid than other algorithms e.g. SVM, Linear Regression and so on. The classifier does not require training samples prior to making predictions. So, new data can be incorporated easily that will not influence the accuracy of the system. It requires only two parameters to execute i.e. the value of 'K' and the distance function (e.g. Euclidean Distance, Hamming Distance, Manhattan Distance, Minkowski Distance). This Classifier provides good results for lesser number of input features but as the number of features increase to high levels it scuffles to predict the output of new data point. One of the main problem is to select the ideal number of neighbors to be considered at classifying the new data entry. Computation expense is quite high because it is essential to calculate distance of every point occurrence to all training samples.

Support Vector Machines (SVM)
SVM is a supervised learning model that is generated for binary classification in both linear and nonlinear forms. Usually, datasets are nonlinearly indivisible, thus the main goal of the SVM method is to capture the finest available surface to make a disconnection among positive and negative training feature samples depending on peril (training and test set error) reduction principle. This method can try to describe a decision boundary with the hyper-planes in a high dimensional feature space. This hyper plane delineates the vectorized data into two classes also finds an outcome to take a decision depending on this support vector. The working method of SVM can be described as follows. Given 'N' linearly separable training set with feature vector 'x' of 'd' dimension. For dual optimization, Where, α ϵ R N and y ϵ {1, − 1}. Then the outcome of SVMs can be described as follows: In SVM classification, it segregates the linear dataset with a single hyper plane that can divide two classes of given feature subset. For nonlinear dataset where more than two classes to be managed, kernel functions are utilized in that state to set the data to a higher dimensional expanse that is linearly separable [18]. The technique has a normalization parameter. Consequently, it has greater simplification abilities that prevent features from over fitting. It uses the kernel trick. So, it can effectively manage the nonlinear data. The small changes in the datasets do not significantly disturb the hyper plane. So this model is stable. Selecting a suitable Kernel function (to manage the nonlinear data) is not a simple process. For high dimension Kernel, it gives rise to many support vectors that minimize the training speed considerably. For large datasets, the classifier takes much time for training. It requires feature scaling for input variables before the classification process. Algorithmic difficulty as well as memory necessity are very high. It needs lot of memory for multi class SVM to save all support vectors also this number raises sharply with the training dataset size.

IV. EXPERIMENTS, RESULTS AND DISCUSSION
In the proposed system, twitter platform has been taken as the main source in which attributes such as the retweet, favorite, isfavorite, twitter messages where taken as the inputs. These attributes are the major constraints of the twitter platform which is normally taken for the study and it reflects the nature of message sharing logic. To identify the side effects of specified drugs (metformin generic and branded) on the individuals, these attributes normally reflects the usage of medicine through its messages.
Meaning of some twitter messages which contain Generic or branded anti diabetic drugs were actually different from that of the perspective of the research. Some twitter messages showed no useful relationship between the drug name, their benefits and side effects. For example "I need metformin" and "@yungliu i need to buy a white metformin but the site won't let me ship to my place?" .Hence unrelated twitter messages were filtered out from the dataset.
Feature extraction process encounter the occurrences of all the specified words (Side effects of metformin) which is used for extraction process. The words and their occurrences are used for labelling the dataset to train the classifier. The proposed technique has been examined with the datasets after pre-processing which are retrieved from the twitter websites. Totally it has 14 attributes, but 6 attributes are more useful for the determination of side effect level of the diabetic (metformin generic and branded) drugs.
In this research, Glowworm Swarm Optimization is used for feature selection process to select significant feature from the extracted feature dataset. This optimizer is utilized to improving the classification process. These selected features can be used in machine learning algorithms for classification process.

Evaluation parameters
In this research, the execution of ML algorithms can be assessed with the components of confusion matrix on a set of testing data. The confusion matrix contains of four elements are True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FP). The evaluation metrics precision, recall, F1_score and G_mean [19] are calculated to estimate the performance level of any classifier.
Accuracy value is the proportion of the accurate number of predictions. It can be determined using the below equation: (8) Precision is the ratio of predicted positive examples which really are positive (9) Recall also called hit rate or sensitivity; it measures how much a classifier can recognize positive examples (10) F1_Score is the 'Harmonic Mean' of recall with precision (11) True Positive rate is a measure of sensitivity of true positive prediction (12) Specificity is a measure of accuracy of True Negative Prediction (13) False Positive Rate is calculated using (14) To describe the performance of the classification algorithms, confusion matrix has been used in this research. It permits envisioning of the execution of an algorithm. It also

Evaluation of accuracy:
The performance of classification algorithms has been evaluated with Accuracy, Precision, Recall and F1_Score performance evaluation parameters. Following table shows the accuracy obtained at various ML methods for classification process. From the above figure, the SVM algorithm has showed highest accuracy rate. In the NB with GSO algorithm while the classification has been applied on the Testing data, among the 53 Twitter messages, 31 has been classified under Side Effects. The remaining 2 cases may be side effects but could not be predicted accurately as side effect. Here, misclassification (i.e.) false negative value is 2. 17 cases has been classified under No Side Effects correctly. The remaining 3 (misclassification) cases may be 'No Side Effects', but could not be predicted accurately as No Side Effects is shown in Fig.9. Overall, 91% of the predictions are correct remaining 9% could not be predicted accurately (Misclassified).

V.
CONCLUSION The purpose of this research paper is to survey the potential of three ML algorithms such as NB, SVM and KNN to classify the side effects level of the antidiabetic drug 'Metformin' Generic and branded through twitter messages. In this research, GSO method is combined with the ML algorithm to select the optimal features for classification process. The efficiency of classification method is assessed in terms of Accuracy, F1_score, precision, and recall. These exploratory results show that SVM classifier is highly effectual and encouraging. The purpose of this research is Pharmocovigilance. SVM with GSO shows 94% accuracy in prediction. TP cases are 33 and TN is 17. These 17 patients can safely continue metformin. For 33 TP patients severity of the side effect is to be accessed at the physicians clinic. If side effect is mild same medication can be continued with regular follow up. For patients with FP, reassurance can be given. Patients with FN need to attend physicians clinic for assessment. This work still has some drawbacks like SVM takes a prolonged training time for our twitter dataset, memory specification of SVM are lofty and it requires feature scaling for input variables before the classification process. All these obstacles need to be regarded for the future upcoming task to improve the twitter opinion classification.