Design of Self-Adjusting algorithm for Twitter Sentiment Analysis

Mr.   Amin Nazir Nagiwale; Mr.    Manish R.    Umale; Mr.    Aditya Kumar Sinha

doi:10.17577/IJERTV5IS020023

Volume 05, Issue 02 (February 2016)

Design of Self-Adjusting algorithm for Twitter Sentiment Analysis

DOI : 10.17577/IJERTV5IS020023

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 58
Total Downloads : 363
Authors : Mr. Amin Nazir Nagiwale, Mr. Manish R. Umale, Mr. Aditya Kumar Sinha
Paper ID : IJERTV5IS020023
Volume & Issue : Volume 05, Issue 02 (February 2016)
DOI : http://dx.doi.org/10.17577/IJERTV5IS020023
Published (First Online): 01-02-2016
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Design of Self-Adjusting algorithm for Twitter Sentiment Analysis

Mr. Amin Nazir Nagiwale

Computer Engineering Lokamanya Tilak College of Engineering

Navi Mumbai, India

Mr. Manish R. Umale

Computer Engineering Lokamanya Tilak College of Engineering

Navi Mumbai, India

Mr. Aditya Kumar Sinha

CDAC Pune, India

AbstractSocial Media is emerged as an open platform that people use it to share/exchange information. Data captured from social media is generated in different formats, different sources, varies in real time i.e. data generated has big data properties like velocity, volume, variety. For class of machine learning algorithms, large scale data processing frameworks like Hadoop, MapReduce are insufficient as they requires intermediate results across multiple executions. Also they have to dealtwith skewness, diversity of data to adapt changes in real time. For example Social media site, Twitter generates approximately 6000 tweets per second and these are according to trends that changes over time. As success of any machine learning algorithm like Sentiment Analysis is depend upon how well it is trained on training set, domain-specific information and corpus of training set.

To overcome these problems we have proposed Selfie,Self- adjusting algorithm in [12].In this paper we have discussed preparing domain specific corpus according to change in trends in real time by using self-adjusting computation. We have implemented Splay tree with trends as tree node and trends near to root of the tree are the most talked/tweeted trends. Each trend node contributes towards domainspecific corpus according to height of tree.

KeywordsTwitter,Self Adjusting Algorithm;Splay Tree; Map Reduce ;Apache Spark

INTRODUCTION

MapReduce framework is suitable for data-intensive applications for large scale processing, but there are classes of applications like machine learning algorithms, graph algorithms, sentiment analysis algorithms, etc. have to dealt with skewness, diversity of data to adapt changes in real time. For example, it is difficult to adapt to real time changes in training data/corpus for big data applications like Sentiment Analysis, Email spam detection, and log file analysis.

Also another observed property the Internet demonstrated in the size of its features, we soon discovered a widespread pattern in their measurements: there are many small elements contained within the Web, but few large ones. A few sites consist of millions of pages, but millions of sites only contain a handful of pages. Few sites contain millions of links, but many sites have one or two. Millions of users flock to a few select sites, giving little attention to millions of others Most of the big data computing frameworks assume the data is static i.e. for iterative and interactive applications it must re- execute the entire execution cycle. Due to skewness in the

access pattern of user issued queries application data changes incrementally e.g. Consider events like government decisions, launch of new smart phone, release of movie based upon which people start expressing their opinions, sentiments through tweets. Main challenge for Sentiment analysis system is to make system domain-specific. As users interest is changing from domains like politic, entertainment, marketing, our system has to update its domain specific vocabulary, language corpus according current trends and train their classifiers on newly collected row data i.e. they follow Zipfian distribution[11].

Therefore we have studied and proposed Self Adjusting algorithm, Selfie[12] for class of iterative and interactive applications that will capture the properties of data according to Zipf law [14] by generating Splay tree. Also Self- Adjusting data structure and operations that will help these applications to cache, persist, results, optimize their computation in efficient manner.

For experimental purposes, we have implemented Selfie, selfadjusting algorithm [1] with Splay tree for Twitter Sentiment analysis, which makes system responsible for skewness in access pattern and diversity in trends because skewness in the access pattern of user issued queries application data changes incrementally e.g. Consider events like government decisions, launch of new smart phone, release of movie based upon which people start expressing their opinions, sentiments through tweets. Main challenge for Sentiment analysis system is to make system domain- specific. As users interest is changing from domains like politic, entertainment, marketing, our system has to update its domain specific vocabulary, language corpus according current trends and train their classifiers on newly collected row data. Therefore proposed algorithm can be helpful for other iterative and interactive applications that faces machine learning challenges like feature generation and selection, over-fitting, explain and improve models to effectively deal with large dynamic data sets.

Proposed system is built on top of Apache Spark [5] andMongoDB[9] harnessing the power of both. Apache Spark will consumes the raw data from various sources and apply user defined MapReduce [7] algorithms on that and processed data is loaded back into MongoDB. MongoDB handles real-time, onlinequeries.
ZIPF LAW

Success of any machine learning depends upon how well algorithm is trained i.e. on training set. Preparing accurate training set is really difficult task as we consider size, speed at which data is growing .Capturing the domain-specific data from huge sea of data requires different approach that will respond at real time.

While studying characteristics and patterns of generated data we found that they follow zifian distribution i.e. only 20% of raw data actually contributes.

Consider Twitter sentiment analysis example, twitter generates 5,47,200 tweets per minute, our observations that there is skweness in twitter trends that will change according to events likes Election result declaration, Oscar awards declaration, death of celebrity i.e. what people are talking/tweeting changes according to real time events. So it becomes very difficult to prepare and train machine learning algorithm on domain-specific information. Another observed property is that only few percentage of generated information actually contributes towards specific information. We proposed solution to this problem by implementing self-adjusting data structure (Splay tree) that respond to the real time changes and capture the required domain specific

information according to Zipf law.

According to Zipf law, the frequency of use of the nth- most-frequently-used word in any natural language is approximately inversely proportional to n. Zipfs Law captures the relationship between frequency and rank:

There is a constant k such that:

According to Zipfs law probability of most frequently occurring object I over n objects is given by:

and thus the probability of item i occurring is

The proposed a framework for architecture for Sentiment Analysis is show in Fig.1.The framework consists of three major subdivisions Acquisition of data, followed by change initiator and change propagation algorithm.
1. Obtaining real-time/test data
  
  We have used the Streaming Twitter API and the public Streams version of it. All the public data flowing through Twitter currently could be collected using this Public Streams.
2. Change Initiator
  
  This part of system identifies the suitable events such as Oscar Award Declaration, Election Result Declarations
  
  ,Twitter current trend to initate change propagation algorithm.
3. Change Propagation
This part of system validates the events raised by the change initiator if access pattern is heavily skewed then change propagation algorithm performs actions like Memorization, self-adjusting data structure management and call the application specific algorithm.

Fig.1. Self-Adjusting framework for Twitter Sentiment Analysis

Proposed system is divided into two major parts: A.Change Propagation algorithm B.Application specific algorithm

We have used Splay tree as self-adjusting data structurefor implementation.

Algorithm 1.1: Change Initiator

Input: events, format of the data (csv,log,Real time data)

Output: Pre-processed data for splay tree (Serialized objects or objects of Node class)

1: Identify the events and call event handling function 2: Collect data related to the event in formats like

csv,log,real time data

3: Perform Filtering & smoothing

4: Apply Transformation Operations on pre-processed data

Algorithm 1.2:Change Propagation

Input: Pre-processed data nodes, Callback events

Output: memorization operation, call for registered events 1: Validate the pre-processed data

2: Perform required operations like insertion, deletion, split, join

3: Call the registered application specific call back event

4: Perform memorization operation on self-adjusting algorithm
SPLAY TREE:SELF ADJUSTING DATA STRUCTURE

A splay tree is a self-adjusting binary search tree with the additional property called splaying. Splaying rearranges the tree such that recently accessed element is placed at the root of the tree so that it is quick to access again. Splay tree performs basic operations like insert, delete and search in O(log n) amortized time.

In our proposed system we use Splay tree [1] as a self- adjusting data structure. Its construction process starts by executing Algorithm 1.1Change Initiator .Algorithm[12] 1.1 keeps track on events like Twitter Trends and provides this as input to Algorithm 1.2 [12] Change Propagation algorithm.

Formula for Mutant Creation

Suppose that {SiiI} is a set of Splay trees generated at consecutive runs R1, R2 .Rnthen we define:

= { | } (1)

Where x is splay tree with common nodes from set of splay trees Si.If number of nodes, n(x) = m > threshold value then expand mutant Si where iIotherwise create new mutant Si+1 where iI.

Algorithm 1.1. [12].provides input to Algorithm 1.2.[12] which will perform user defined function that are based on properties of Splay tree e.g. for Sentiment Analysis it will fetch the tweets based on the height of topic in Splay tree. Topics that are close to the root of the tree will contribute more to training set than topics that are closer to the leaf node.

F (t) =Tmin x ht Where, F =user defined function t = topic present in Splay tree Tmin = threshold value

ht =height of topic in splay tree

In our Sentiment Analysis example function F contributes to the training set to make train classifier on topics that are currently trending i.e. training set will become more domain specific.

F (#SelfiwithDaughter)>F (#DaagiMantriModike)> F (#Greece)> . >F (Zimbabwe)>F (Babulal Guar)

To keeps the track of tweets that are fetched Algorithm

1.2 saves Splay tree, shown in Fig.2.
RESULTS

We have implemented selfie algorithm in Twitter Sentiment Analysis to make system domain-specific with respective with respect to change in trends.

We have created training set that is divided into two parts first part contains language-specific (e.g. English) tweets that are randomly collected according to emotions/sentiment present in tweet and second part is dynamically created which contains domain-specific tweets. We have collected 8004 tweets out of which 4002 are positive and 4002 are negative which contributes language specific corpus and remains static.

Experiment is performed on machine with 4GB RAM, processor2.50GHzIntel(R) Core(TM) i5 with 395GB memory and having Ubuntu 14.04 operating system.

Twitter Sentiment analysis , real-time querying and reporting with MongoDB is implemented in Python Programming Language and Algorithm 1.1 [12] and Algorithm 1.2[12] are implemented on cluster computing framework Apache Spark[5] in JAVA programming language. With the help of NLTK metrics module we trained the NaÃ¯ve Bayes algorithm to calculate precision and recall:

Accuracy : 0.991

Positiveprecision : 0.997971602434

Positiverecall : 0.984

Negative precision : 0.984220907298 Negative recall : 0.998
CONCLUSION

We observed collecting domain-specific tweets is difficult because of most of the user uses words that are not part of any standard language corpus. They customize it .In future work we work on these problems also includes feature like polarity weight distribution according to height of the splay tree. We presently implemented Self-Adjusting algorithm for Twitter Sentiment analysis using MongoDB and Apache Spark, experimental results shows that this algorithm is helpful to train the system according to domain specific information available in real-time efficiently.
REFERENCES

A. Acar. Self-Adjusting Computation. PhD thesis, Department of Computer Science, Carnegie Mellon University.
Programmable Self-Adjusting Computation Ruy Ley-Wild
Streaming Big Data with Self-Adjusting Computation Umut A. Acar,

Yan Chen
Parallel Implementation of Big Data Pre-Processing Algorithms for Sentiment Analysis of Social NetworkingData V.Jude Nirmal and D.I. George Amalarethinam.
Spark: Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley.
Fast Algorithms for Frequent Itemset Mining from Uncertain Data 2014 IEEE International Conference on Data Mining.
Hadoop Map/Reduce tutorial. http://hadoop.apache.org
NoSQL https://en.wikipedia.org/wiki/NoSQL.
MongoDB https://www.mongodb.org/
Twister: Iterative MapReducehttp://iterativemapreduce.org
Theory of Zipf's Law and Beyond Alexander Saichev Yannick Malevergne Didier Sornette
Mr.Amin Nazir Nagiwale, Mr. Manish R. Umale, Mr. Aditya Kumar Sinha,Design of Self-Adjusting algorithm for data-intensive MapReduce Applications IEEE conference on International Conference on Energy Systems and Applications (ICESA 2015), IEEE Pune Section

Design of Self-Adjusting algorithm for Twitter Sentiment Analysis

Leave a Reply