Analysis of Public Health Concerns using Two-step Sentiment Classification

Pondora Naresh Behera; Suneetha Eluri

doi:10.17577/IJERTV4IS090641

Volume 04, Issue 09 (September 2015)

Analysis of Public Health Concerns using Two-step Sentiment Classification

DOI : 10.17577/IJERTV4IS090641

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 140
Total Downloads : 392
Authors : Pondora Naresh Behera, Suneetha Eluri
Paper ID : IJERTV4IS090641
Volume & Issue : Volume 04, Issue 09 (September 2015)
DOI : http://dx.doi.org/10.17577/IJERTV4IS090641
Published (First Online): 24-09-2015
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Analysis of Public Health Concerns using Two-step Sentiment Classification

Naresh Behera

Department of Computer Science JNT University Kakinada Kakinada, India -533003

Mrs. Suneetha Eluri

Assistant Professor, Department of Computer Science,

University Kakinada Kakinada, India -533003

Abstract Our aim is to develop a sentiment analysis tool for public health officials to monitor the spreading epidemics in a certain region and time period. Analyzing the public concerns and emotions about health related matters is an important issue to know the spreading of a disease. In this work, sentiment classification of Twitter messages is focused to measure the Degree of Concern (DOC) of the people about a disease spreading. In order to achieve this goal, the disease related tweets are extracted based on time and geographical location. Then, a novel two-step sentiment classification is applied to identify the personal negative tweets. First, the clue-based algorithm is used to classify the personal tweets from non personal tweets by using subjectivity clues. Next, lexicon-based algorithm and NaÃ¯ve Bayes classifiers are applied to classify negative and non-negative personal tweets. The personal negative tweets are used to measure Degree of Concern. The Public Health Surveillance System (PHSS) is also developed by using visualization techniques such as maps, graphs and charts to visualize the Degree of Concern (DOC) of the epidemic related twitter data. The visual concern graphs and charts can help health specialists to monitor the progression and peaks of health concerns of people for a disease in particular space and time, so that necessary preventive actions can be taken by public health officials. Negation Handling and Laplacian Smoothing techniques are used with Lexicon Based classifier and NaÃ¯ve Bayes classifier to improve performance.

Keywords Degree of concern; Disease Spreading; Public Health; Polarity; Sentiment Analysis; Social Network; Twitter
1. INTRODUCTION

Sentiment Analysis

Whenever we need to make a decision we need to know others views, opinions and advice. It is necessary for both individual and organizations. With exponential growth of the social network content in the internet, the views and opinions of people can be easily extracted. These users not only use the available resources in the web, but also give their feedback, so that additional useful information is generated. To evaluate and analyze this huge amount of information, Sentiment Analysis is originated.

Sentiment Analysis or Opinion Mining [5] is a task that extracts information from social networks and it identifies the users opinions, views and emotional feelings in the form of positive, negative and neutral, by using Natural Language Processing technique. Sentiment Analysis on social media is widely used in different areas [6], such as marketing, business, election prediction, education, medical and

communication. But the recent challenge task for sentiment analysis is on health related data such as public health surveillance, disease ontology, health maps, spreading epidemics, disease detection etc.
Monitoring Public Health Concern

Monitoring the Public Health, disease spreading and controlling it, are the important responsibilities for Public Health Officials. They analyze the public opinions, emotions and concerns about health related matters when there is an indication of a sudden disease outbreak. Different monitoring strategies have been developed to analyze the public health. These strategies include household surveys, laboratory based surveillance, sentinel surveillance systems, and the most- recent IDSR (Integrated Disease Surveillance and Response) [7]. Among these strategies, peoples emotional changes, due to sudden disease outbreak, have caught increasing attention of health officials. X.Zhu et al. [16] analyzed the mental state of people of china during the outbreak of SARS (2003).Based on their analysis, during the outbreak 94.6% people are surveyed, and reported the emotional changes. Among them 54.8% are panic, 34.0% are nervousness, 7.6% are fear and 23.3% are admitted to irrational behaviors such as seeking shelters, going on a shopping spree etc. Thus, it is critical to monitor health issues for public health officials and Government decision makers. However, it is hard to monitor public health and their emotional changes using traditional surveillance system. The existing methods such as, questionnaires and clinical tests are very slow and can only cover limited number of people. A novel system must be developed to supplement the existing system. This tool must track the real time statistics of public emotions related to different health issues, to provide early warning and to help public health officials and government decision makers to prevent necessary actions.

Social networks, such as Google news, blogs, search engines, twitter, facebook etc. has abundant resources for monitoring threats of public health. Twitter, a micro-blog service provider has many advantages than others for disease surveillance. Twitter has more than 500 million users posted more than 400 millions tweets per day. It is up-to-date and most tweets are public related. It is fixed length message i.e., 140 characters. Twitter API [15] enables to extract the tweets along with related information, such as, geographic location, time and hyperlinks.

In this work, Twitter is used as ultimate resource for extract the opinions of public related to health matters. It helps the government decision makers and public health officials to gauge the degree of concern (DOC) calculated in the tweets of Twitter users who are under impact of disease. The early detection of public health concerns can assist health officials to take timely decision to counter rumors, thus prevent potential social crises. In order to calculate DOC of user tweets, a classification technique is developed to analyze the sentiments of disease related tweets. This technique involves two steps. First, using subjectivity clues it separates personal tweets from news (non-personal) tweets. Personal tweets are posted by individual users and non-personal tweets or news tweets are released by online media and possibly re- tweeted by twitter users. In second stage, sentiment classification is used to classify the personal negative tweets from personal non-negative (neutral) tweets by using Lexicon based classifier and NaÃ¯ve bayes classifier. Finally, Public Health Surveillance System (PHSS) is a visualization system with graphs and charts used to visualize DOC to the public health officials.

categories: positive, negative, neutral, and irrelevant. They used the relative difference of positive and negative messages and then calculated the H1N1 vaccine sentiment score.

IMPLEMENTATION ISSUES

A. System Architecture
1. RELATED WORK
  1. Sentiment Classification using Twitter
    
    In sentiment analysis, B.Pang et al. trained an algorithm to classify the sentiments of online movie reviews. Pandey and Iyer [12] proved that instead of using common text features used in traditional information retrieval tasks, the
  2. Tweet Extraction
  Fig: System Architecture
  
  domain specific features has more significance. Barbosa and Feng [13] focused on the process in which, the training data is automatically generated. They used three sources: Tweet Feel, Twitter Sentiment, and Twendz to label the sentiments of tweets. The NaÃ¯ve Bayes classifier is reported by Yu et al.
  [14] as te best in terms of precision and recall, when applied to sentiment classification of news articles.
  1. Monitoring of Disease Spreading
    
    Ginsberg, [1] used search engines to analyze the sentiments of users based on their queries. Aramaki, [4] used different Machine Learning methods to classify epidemic-related tweets into two classes (positive or negative). Collier et al. developed a model that classifies the Twitter messages automatically into six fixed syndromic categories, such as Respiratory and Gastrointestinal. Signorini et al. analysed H1N1-related tweets using a SVM-based estimator, and estimated the ILI rate before the official announcement by one to two weeks. Using online news, Brownstein et al. [20] developed the system, Health- Map, which collects reports from Google News and classifies the news into disease related and unrelated reports and filter the disease related new into warnings Breaking News and old news. Similarly, Culotta [2] correlated user tweets with CDC statistics using number of regression models and using a large number of Twitter messages they provided a relatively simple method to track the ILI rate. Lampos [18] et al. used a method which helps to compute flu scores using a set of markers, and get a high association with HPA flu score, which is equal to the CDC score in UK. SalathÃ© and handelwal [19] analysed the reaction of the Twitter users towards the H1N1 vaccine using sentiment analysis. They categorised user tweets into four
    
    Using Twitter OAuth Authentication[15] the disease related tweets are extracted based on time and geographical location(latitude and longitude) using the keywords related to diseases. The disease related keyword, time and geo-code are used as parameters to extract the disease related tweets. The geo-code involves latitude, longitude and radius (miles or kilometres). For Example; if we want to extract the tweets form India then the geo-code will be, 78,21,3000mi or 78,21, 3700km. The latitude and longitude may be either positive or negative.
  2. Tweet Preprocessing
    
    In pre-processing techniques we removed the following: Remove Special symbols and digits: The digits, punctuation marks and symbols like !,@,#,$,%,^,+,-*,/ etc. are removed. Remove URLs: The URLs in tweets are removed.
    
    Remove Duplicate Tweets: The tweets which start with RT and repeated tweets are removed.
    
    Remove Stop words: The stop words i.e. a, an, the, if, on, of etc. are removed.
    
    Normalize Elongated words: Some characters are repeated multiple times in a word, for example instead of good, the word typed as goooood or for super, suuuuppppeeerrr should be normalized.
    
    Replace emoticons: The emoticons like, (:,): D: D etc. are replaced by their correspondent emotional words based on the logical meaning of the emoticons.
    
    TABLE-I: LIST OF EMOTICONS
    
    negative
    
    neutral
    
    positive
    
    🙁
    
    😐
    
    🙂
    
    : (
    
    : )
    
    🙁
    
    🙂
    
    (
    
    😀
    
    ;(
    
    :p
  3. Tweet Classification
    
    A Novel Two-Step Sentiment Analysis Technique [2] is used:
    
    Step-1: A clue based algorithm is used to classify the personal and non-personal tweets using subjectivity clues [10].
    
    Step-2: Lexicon Based algorithm and Naive Bayes algorithm is applied to personal tweets for classifying positive, negative and neutral tweets.
    
    Negation Handling and Laplacian Smoothing is also used for improving accuracy of classification.
  4. Calculating Degree of Concern (DOC)
  The personal negative tweets are used to measure the Degree of Concern, DOC [d, t], for a particular disease d and a particular time t.
  
  NN 2
  1. Twitter Sentiment Classifiers
    
    Lexicon based algorithm and NaÃ¯ve Bayes algoritthms are used to classify the polarity[14] of tweets,such as, poitive,negative and neutral. Negation Handling and Laplacian Smoothing techniques are used to improve the accuracy of the classifiers
  2. Lexicon-Based Classifier
    - Step: 1 Divide a message M into words Mi = {w1, w2, w3………}, i=1, 2, n
    - Step 2: for each wi, compare with data dictionary of
      
      +ve and ve words and Return +ve polarity and -ve polarity.
    - Step 3: Calculate overall polarity of a word=sum(+ve polarity)-sum(-ve polarity)
    - Step 4: Repeat step 2 until end of words
    - Step 5: add the polarities of all words of a message
      
      i.e. total polarity of a message.
    - Step 6: Based on that polarity, message can be positive or negative or neutral.
    - Step 7: repeat step 1 until M is NULL
  3. NaÃ¯ve Bayes Classifier
  Probability of a word belonging to a particular class is given by the expression:
  
  Count of xi in message of class c
  
  DOC [d,t] =
  
  PN
  
  —–(1)
  
  P (xi|c)=
  
  Total no. of worods in messages of class c
  
  —– (2)
  
  d- a particular disease t- a particular time
  
  NN- number of negative personal tweets
  
  According to the Bayes Rule, the probability of a particular tweet 'd belonging to a class Ci is given by,
  
  P(d | c ) * P(c )
  
  PN-number of personal tweets
  
  P (ci
  
  | d)
  
  i i P(d)
  
  —– (3)
  
  F. Visualization
  
  The experiment is done with three diseases, Malaria,
  
  P (c
  
  | d)
  
  ( P(xi
  
  | c j
  
  )) * P(c j )
  
  —– (4)
  
  Cancer and Swine-Flu. The visual concern graphs and charts are used to visualize the Degree of Concern of these three diseases.
  1. ALGORITHMS
    
    A. Clue based classifier for Personal Tweets
    
    Clue-based classifier divides each tweet into a set of words
    
    and matches them with a corpus of personal clues. For personal versus non-personal classification, subjective corpus
    1. are used, if there are enough subjective clues in the tweet, it can be regarded as personal tweet, otherwise it is a news tweet. The corpus from the literature [9] contains 8,221words, 5569 clues are strongly subjective clues and
      
      i P(d)
      - P(Ci | d) = probability of instance d being in class Ci
      - P(d| Ci ) = probability of generating instance d in given classCi
      - P(Ci) = probability of occurrence of class
      - P(d) = probability of instance d occurring
    1. Laplacian Smoothing
      
      If the classifier encounters a word that has not been seen in the training set, the probability of both the classes would become zero and there wont be anything to compare between. This problem can be solved by Laplacian smoothing,
      
      Count(x ) k
      
      2652 clues are weakly subjective clues.
      
      We counted the number of strongly subjective [9] terms, the number of weakly subjective terms, in each tweet and
      
      P (x i
      
      | c j
      
      ) i
      
      (k 1) * (No. of words in class c j )
      
      —– (5)
      
      experimented with different thresholds. A tweet is classified as personal if its count of subjective words exceeds the chosen threshold; otherwise it is classified as a non-personal tweet.
      
      Usually, k is chosen is 1
    2. Nagation Handling
    Algorithm:-
    
    Negated: = False
    
    For each word in document: If negated = True:
    
    Transform word to not_ + word. If word is not or nt:
    
    If a punctuation mark is encountered Negated: = False
  2. RESULTS
    1. Classification of Disease related Tweets
      
      The tweets are extracted based on keywords of the major diseases malaria, cancer and swine flu and preprocessed the tweets. Then, sentiment Analysis is used to find the sentiments of each tweets of every disease.
      
      Fig-1: polarity and emotion classification of cancer related tweets
      
      Total tweets extracted are, for malari 1500, for cancer 1500 and for swine flu 1498. Among them, polarities for malaria (positive=800, negative=470 and neutral= 230), for cancer (positive=300, negative=850 and neutral =350) and for swine flu (positive=630, negative=770 and neutral =100).
      
      The Degree of Concern and count of disease (i.e. malaria, cancer and swine flu) related tweets and their polarities are shown below:
      
      Fig -3:-No. Of Total tweets, negative, positive and neutral tweets for the diseases cancer, swine flu and malaria
    2. Month-wise Degree of Concern
      
      The disease related tweets are extracted and analyzed from the march 2015 to august 2015 and analyzed the tweets by calculating DOC for every month.
      
      Fig-4:- Month-wise Degree of concerns for diseases
    3. Comparision of Various techniques used
    Precision and Accuracy is computed to compare performance of various algorithms.
    
    numberof true positives
    
    Precision —– (6)
    
    numberof true positevs false positives
    
    no. of of true positives
    
    Accuracy
    
    no. of true positevs false positives false negatives true negatives
    
    —– (7)
    
    Fig -2:=Degree of concern of different diseases
    
    The tweets which are classified by sentiment classifiers are compared with manually classified tweets and calculated the Precision and Accuracy. For simple NaÃ¯ve Bayes classifier, precision and accuracy are 85.66% and 86.29%, for Lexicon based classifier precision=74.54% and accuracy=76.22%. Then, the negation handling technique is combined to both the classifiers and performance is improved, which are shown in below figure.
    
    Fig-5: Performance measure of sentiment classifiers
  3. CONCLUSION AND FUTURE WORK

This work presents the tweet classification approach to identify the negative sentiment of personal health tweets to measure the degree of concern (DOC) for monitoring the public sentiments for a disease. The charts ,table and graphs are developed to visualize the DOC.If the DOC is more for a disease in a particular location and period means spreading of that disease is more in that region. So that, public health officials will take preventive actions.

We can extend the number of disease events to be monitored by implementing disease ontology. We can also use the symptoms of various diseases to detect and predict the disease. We can analyze the death toll rate due to spreading of diseases. In Addition with twitter, extract input from facebook, personal blogs, news forums etc.

REFERENCES

Ginsberg, M. H. Mohebbi, M. S. Smolinski, and L. Brilliant, Detecting influenza epidemics using search engine query data, Nature 457, 2009, pp. 1012-1014.
Ji, X., Chun, S. A. and Geller, J. Monitoring Public Health Concerns Using Twitter Sentiment Classifications. In Proceedings of International Conference on Health Informatics, Philadelphia, PA, 2013.
A. Culotta, Towards detecting inuenza epidemics by analyzing Twitter messages, 1st Workshop on Social Media Analytics (SOMA 10), Washington, DC, USA, 2010..
E. Aramaki, S. Maskawa, and M. Morita, Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2011.R. Nicole, Title of paper with only first word capitalized, J. Name Stand. Abbrev., in press.
Jayashri Khairnar, Mayura Kinikar, Machine Learning Algorithms for Opinion Mining andSentiment Classification International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013,ISSN 2250-3153.
Yongin, South Krea; Khattak,A.M.:Sungyoung Lee; Maqbool, J, Precise Tweet Classification and Sentiment Analysis Published in:
Disease Control Priorities Project, http://www.dcp2.org/file /153/dcpp- surveillance.pdf, accessed on 02/15/2013..
Stopwords, http://web.njit.edu/~xj25/eosds_beta/files/ newsstopword. xlsx
E. Riloff and J. Wiebe, Learning Extraction Patterns for Subjective Expressions, In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03), 2003.
Subjectivity lexicon, http://www.cs.pitt.edu/mpqa/, accessed on 7/15/2012.
B. Pang and L. Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval, Vol. 2, No 1-2 pp. 1 135, 2008.
V. Pandey and C.V.K. Iyer, Sentiment Analysis of Microblogs,

Technical Report, Stanford University, 2009
L. Barbosa, and J. Feng, Robust Sentiment Detection on Twitter from Biased and Noisy Data, In Proceedings of the23rd International Conference on Computational Linguistics:Posters, 2010.
H. Yu and V. Hatzivassiloglou, Towards Answering

Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences, InProceedings of the 2003 conference on Empirical methods in natural language processing, 2003.
Twitter developers documentation, https://apps.twitter.com/docs, accessed on 2/15/2015.
X. Zhu, S. Wu, D. Miao, and Y. Li, Changes in Emotion of The Chinese Public In Regard to The SARS Period, Social Behavior & Personality, Vol. 36, Issue 4, pp. 447, 2008.
B. Pang, L. Lee, S. Vaithyanathan, Thumbs up? sentiment classification using machine learning techniques, In proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 79-86. July 2002.
Vasileios Lampos and Nello Cristianini, Tracking the flu pandemic by monitoring the Social Web, 2nd International Workshop on Cognitive Information Processing (CIP), 2010.
M. SalathÃ© and S. Khandelwal, Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control, PLoS Comput Biol 7(10): e1002199. doi:10.1371/journal.pcbi. 1002199,2011.
J. S. Brownstein, C. C. Freifeld, B. Y. Reis, and K. D. Mandl, Surveillance Sans FrontiÃ¨res: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project, PLoS Med 5(7): e151. doi:10.1371/journal.pmed.0050151, 2008.

negative	neutral	positive
🙁	😐	🙂
: (		: )
🙁		🙂
(		😀
;(		:p

Analysis of Public Health Concerns using Two-step Sentiment Classification

Leave a Reply