Detecting Rumors from Blogging Sites using Recurrent Neural Networks

DOI : 10.17577/IJERTV10IS030322
Download Full-Text PDF Cite this Publication

Text Only Version

 

Detecting Rumors from Blogging Sites using Recurrent Neural Networks

Gostu Venkata Lakshmi Deepthi, Tumati Naga Sai Mahitha, Karamcheti Tirumala Monika, Anusha Sanampudi

Assistant Professor Depatment of Computer Science and Engineering Final Year Students Department of Computer Science and Engineering

R. M. K. Engineering College,R.S.M Nagar, Kavaraipettai-601 206

Abstract: Now a days false claims and rumors are been spreading very widely through the social media which may affect people perspective and may also lead some harm. This rumors are mainly growing faster through the twitter platform which is essential to classify or detect the posted news is a true incident or just a rumor. In our proposed model we are using Machine learning and deep learning techniques to classify the posted news was a rumor or not. Here, we are using Naive Bayes, Support Vector Machine (SVM) and Logistic Regression models from the machine learning and Neural Network from the deep learning. The classification is performed on the stance that was possessed by the tweet. Those stance are divided into four categories, named as supporting, deny, query and commenting. We will use this categories for our classification using the mentioned machine and deep learning techniques and predict the accuracies of each model we are using. At final, we are comparing the accuracies of each model we are using that which can be used further for the detection of the rumor.

Keywords: Rumor classification, supporting, deny, query, commenting, Machine Learning, Deep Learning, Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Neural Network

  1. INTRODUCTION

    Twitter is one of the social media service that which is mostly used for the communications. This can be used differently by the people. Where, some use for the entertainment and some other use for the news are to provide any information. Thus it is said to be that twitter is mainly said to be as the channel for communication considering the main source for information.

    People are been using the twitter platform to mainly spread an information regarding any disasters that are happening through the worldwide. This is been continuing from several years. The disasters that may include about an earth quake, any attack etc. As telecommunications may be poor or can be outage during any disaster where giving information or communication could not be possible.

    Where, social network is considered as a strong network that can pass an information within a very minor span of time and can be helpful that which can be possible by the internet even though in the absence of telecommunications. And twitter is considered as the robustly used social network platform that which passes the news or information very progressively.

    However, we cannot expect every news or information as the positive or correct one. Where, this could not be beneficial ever time and leads to negative information and

    rumors. As, twitter is being used very widely spreading of rumors is also raised a lot. This rumors can impact the people in confusion and which can affect in the times of emergency. This all can lead to decrease in the belief of social media and to believe the news or information that are posted and could not be able to help in the times of emergency need.

    This problem motivated us to initiate a method that which can detect the information as a rumor or not. As machine learning and deep learning are using widely and gives predictions very accurately we are opting the techniques of this. Our proposed work is based on the machine and deep learning techniques to classify the information or the news as a rumored or not.

    Where, Machine learning considered as the sub part of Artificial Intelligence. Where, it is a process of designing a new algorithm and developing that designed algorithm from which a computer can produce the outputs to the given inputs or the data on its own. There are mainly three types where machine learning algorithms are classified 1. Supervised 2. Unsupervised and 3. Reinforcement Learning. In our project we are using Supervised Learning and Unsupervised Learning algorithms.

    Supervised learning is one of machine learning technique used for the problems of classifications for the labeled data. In our proposed work we mainly use naive bayes, support vector machine (SVM) and logistic regression from the machine learning whereas neural network from the deep learning. This models are used to predict the accuracy in finding the information as a rumor or not by using the stance which are divided into four categories.

    In this framework, Introduction was discussed in section 1. Related work will be discussed in section 2, section 3 describes about our proposed, where section 4 provides results of our work and we will conclude the paper in section 5.

  2. RELATED WORKS

    In [2] it gives details about the twitter that how it is being used in the times of any disasters. Twitter is the popular platform that is being used by people to pass an information about any natural disasters that takes place. In the absence of the telecommunications internet is the best source that which is used to forward the information to other people. To pass the news or information through internet source, a platform is required, where social media can support this to carry the news and for the maximum twitter is a social platform that is being used to pass such types of information. And here authors explain about network security extension

    of the twitter named Twimight. This is an open source client of twitter where an android phone can support the feature, disaster mode is on the feature that can be in built within the android phone which can be enabled by the user. With the help of this feature a person can store the information about the disaster on their phone even if they are not able to tweet and can switch that information with another user if they are in their range of Bluetooth. Thus the information is shared with another person and can tweet that.

    Authors in [5] explains about a rumor information that was passed through the twitter platform. In 2011 the post about an earthquake in japan was posted where that is considered as a disaster but later it was proved as wrong and just an rumor or misinformation. It was proved as a rumor after 1 week by checking all the posts in the tweet that were posted by the people in the japan. In their work they demonstrated about a model stochastic agent based that which can observe all the tweets and in particularly it can check about the rumored tweets. From this they observe the rate of number of people who are been affected by the false information that was provided in a twitter and counts the rate of people who still believes in the rumor without knowing it as a false information which can be little harder to observe manually. Here, the proposed model is combined with the real data in the different scenarios to estimate the rate of people who are believing in the false information and that is affecting them. In this [8] work a model was created called named entity for the process of recognizing the particular text that was taken from any dataset. From this process of recognizing we are acquiring the rules from the considered label data with the help of the proposed model. Here the obtained rules are considered as the features that which will be further used by the machine learning based on the considered model. Along with this we also acquire the information of the word from the considered data. This information about the word includes the classes of a particular word and also classes of each co-occurring of that each word and etc. For this model a dataset was used that which includes billion words. After performing the model it gives best accuracy in recognizing the words by used the proposed model.

    Here [10] the study is about the retweet messages which are used to stop the misinformation during in the situations of emergency using twitter. Social medias like Facebook, twitter, YouTube are been used very widely to pass an emergency situation or any disaster. Along with providing a huge advantages by these online media it also spread the wrong information that which may causes an issue. In twitter retweets are performed that which can raise an issue on the rumor. In this work a survey is conducted on the retweets where mainly three factors are considered to conduct the survey. Those are, what is the desire behind the retweeting which may be either for supporting the tweet or opposing the rumor. Here a twitter function was used to mark the information as favorite one and the last one is searching the more content about message that was retweeted. From this we can understand the user behavior about the information diffusion which can help in reducing the spread of misinformation.

    Here [11] an algorithm is proposed which is used to detect the event by monitoring the tweets. Now a days, twitter has

    received a large attention as it is being used a popular social media platform that which is helpful to detecting any emergency events and helping in time. Here the algorithm is used to monitor the tweets that are made and find the particular target event that which needs help in the emergency places or the places where the disasters occurred. To detect any target here classification is been used that which classifies the some of the features. Here, the features we can consider as the keywords of the tweeted data, word counting which can consider the words that are present and the context based on the tweet. After extracting the features we apply the model for the target event to find out the event location.

    Author & Year Proposed Finding/Outcom es
    Hossman n, Carta, Schatzma nn, Legendr

    ,

    Gunning berg,201 1

    Twitter in disaster mode: security architecture, in Proceedings of the Special Workshop on Internet and Disasters It gives details about the twitter that how it is being used in the times of any disasters
    Shirai, Sakaki, Torium, Shinoda, Kazama, Noda, Numao, and S. Kurihara, 2012 Estimation of false rumor diffusion model and estimation of prevention model of false rumor diffusion on twitter Explains about a rumor information that was passed through the twitter platform and an model was implemented to predict the rumor
    T.

    Iwakura, 2011

    A named entity recognition method using rules acquired from unlabeled data A model was created called named entity for the process of recognizing the particular text that was taken from any dataset
    Umejima

    , M.

    Miyabe, Aramaki, and Nadamot o, 2011

    Tendency of rumor and correction re- tweet on the twitter during disasters The study is about the retweet messages which are used to stop the misinformation during in the situations of emergency using twitter
    Sakaki, Okazaki, and Y. Matsuo, 2010 Earthquake shakes twitter users: real-time event detection by social sensors An algorithm is proposed which is used to detect the event by monitoring the tweets

    Table 1: Related Works Summary

  3. METHODOLOGY

    The procedure to develop our system is clearly described in this section.

    • For the process we will collect twitter dataset which consists of rumor threads.
    • The classification is performed on the stance that was possessed by the tweet. Those stance are

      divided into four categories, named as supporting, deny, query and commenting.

    • Once the stance category is selected the selected data is preprocessed. Where, data is cleaned, retrieved and pickle file is created to store the retrieved and cleaned data.
    • Once the preprocessing is completed we are tokenizing using NLTK word tokenizer
    • After tokenizing the considered data we extract the features after using word2vec model. That which stores all the words in a vector.
    • Once the conversion is completed we count the negative words that are present like no, never, etc.
    • After this we will check for punctuations that are used or not.
    • Same as punctuations we will also check for question mark, exclamation mark.
    • Along with this we also check for any URL or any hashtags that are been used, mentioning any user for the tweet.
    • Also any media that which may include an image or video related to media,
    • Above of all these we analyze about sentimental that may be include and we can analyze it by storing as 1 which means positive and 0 which classifies negative.
    • We also look for type of swear or bad words that include in the preprocessed data.
    • Once the features are extracted we use machine learning and deep learning algorithms as classifiers that which can classify the difference in rumored information or actual information
    • For this we are using Naive Bayes, Support Vector Machine (SVM) and Logistic Regression models from the machine learning and Neural Network from the deep learning.
    • Once the process of classification is completed we compare all the used models check for the predicted accuracies.

    Naive Bayes

    Naive Bayes is one of the machine learning algorithm which is a supervised one. It is mainly depends on a theorem named bayes which is mainly used for classification problems. It is mainly used for classifying the text dataset and can handle large datasets. It is considered as one of the simple and effectively working algorithm for the classification and we can get predictions very quickly. The predictions are mainly classified on the basis of the object probability. We can use this naive bayes for classifying text, sentimental analysis and also for any filtering of spam or not in any mails. Where these are some of the examples of this algorithm.

    It is called as naive because the features that are considered independently without depending on the other features of the objects. Where it consider only the specific features. For example if we consider a fruit of mango it consider some of the features of that fruit like color, shape, texture and classify it as a mango fruit. And, Bayes is named because of the

    theorem that it depends on. This can also be called as Bayes rule or law.

    Before applying this algorithm the whole data is to be converted into a table format that which are the preprocessed or the features that are been considered from the given data. Now we will find the probabilities of the particular features and then we will apply theorem of bayes to calculate that probability. Here the training set is given initially that which is used to extract the features from the past data and can compare it during the test period and shows the predictions and accuracy of that predicted result.

    Support Vector Machine

    Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.

    The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

    SVM chooses the extreme points/vectors that help in creating the hyperplae. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane

    Logistic Regression

    Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

    Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems.

    Neural Networks

    A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus a neural network is either a biological neural network, made up of real biological neurons, or an artificial neural network, for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be 1 and 1.

  4. RESULTS AND DISCUSSIONS

    In this session we will discuss about the experimental results obtained by using machine learning and deep learning models and comparison of the accuracies obtained by the models is shown.

    Fig 1. Model Accuracy comparison graph

    Model Accuracy
    Logistic Regression 77.79
    SVM 74.93
    Naive Bayes 72.55
    Neural Network 75.6

    Table 2. Model Accuracy comparison graph

    From the above graph and table of accuracy comparison of models, we can find the accuracies that are obtained by using the Logistic Regression, Support Vector Machine, Naive Bayes from Machine Learning and Neural Networks of Deep Learning.

    When compared with Neural Networks, Naive Bayes and SVM, Logistic Regression provides more accuracy. That means the predictions are given more accurately that are obtained by Logistic Regression. Hence, we can use or select the Logistic Regression for more accurate results and predictions. Below shown figure is the accuracy obtained by the model i.e. Logistic Regression.

    Fig 1. Logistic Regression model accuracy graph

  5. CONCLUSION

This study provides performance prediction of a Rumor stance classification using the machine and deep learning algorithms namely, Logistic Regression, SVM, Naive Bayes and the Neural Networks from the deep learning. Once the performance is predicted by using our four models we check for the accuracy of the models. Performance is predicted by using the model that which gets more accuracy. In our model Logistic Regression provides more accuracy when compared with other machine learning and deep learning models.

REFERENCES

  1. D. Zhao and M. B. Rosson, How and why people twitter: the role that micro-blogging plays in informal communication at work, in Proceedings of the ACM 2009 international conference on Supporting group work, ser. GROUP 09. New York, NY, USA: ACM, 2009, pp. 243252. [Online]. Available: http://doi.acm.org/10.1145/1531674.1531710
  2. T. Hossmann, P. Carta, D. Schatzmann, F. Legendre, P. Gunningberg, and C. Rohner, Twitter in disaster mode: security architecture, in Proceedings of the Special Workshop on Internet and Disasters, ser. SWID 11. New York, NY, USA: ACM, 2011, pp. 7:17:8. [Online]. Available: http://doi.acm.org/10.1145/2079360.2079367
  3. A. Hermida, From TV to twitter: How ambient news became ambient journalism, Media/Culture Journal, vol. 13, no. 2, 2010.
  4. White paper 2011, Information and Communications in Japan. Ministry of Internal Affairs and Communications, Japan, 2011.
  5. T. Shirai, T. Sakaki, F. Toriumi, K. Shinoda, K. Kazama, I. Noda,

    M. Numao, and S. Kurihara, Estimation of false rumor diffusion model and estimation of prevention model of false rumor diffusion on twitter (in japanese), in The 26th Anual Conference of the Japanese Society for Artificial Intelligence, 2012.

  6. K. S. Jones, Journal of Documentation, vol. 28, no. 1, pp. 11 21, 1972.
  7. E. F. Tjong Kim Sang and F. De Meulder, Introduction to the conll2003 shared task: language-independent named entity recognition, in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 – Volume 4, ser. CONLL 03. Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp. 142147. [Online]. Available: http://dx.doi.org/10.3115/1119176.1119195
  8. T. Iwakura, A named entity recognition method using rules acquired from unlabeled data, in RANLP, 2011, pp. 170177.
  9. R. A. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1999.
  10. A. Umejima, M. Miyabe, E. Aramaki, and A. Nadamoto, Tendency of rumor and correction re-tweet on the twitter during disasters, IPSJ SIG Notes, vol. 2011, no. 4, pp. 16, 2011-07-26. [Online]. Available: http://ci.nii.ac.jp/naid/110008583012/en/
  11. T. Sakaki, M. Okazaki, and Y. Matsuo, Earthquake shakes twitter users: real-time event detection by social sensors, in Proceedings of the 19th international conference on World Wide Web, ser. WWW 10. New York, NY, USA: ACM, 2010, pp. 851860. [Online]. Available: http://doi.acm.org/10.1145/1772690.1772777

Leave a Reply