Personality Analysis using Social Media

Sagar Patel; Mansi Nimje; Akshay Shetty; Sagar Kulkarni

doi:10.17577/IJERTCONV9IS03066

NTASU - 2020 (Volume 09 - Issue 03)

Personality Analysis using Social Media

DOI : 10.17577/IJERTCONV9IS03066

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 1,383
Authors : Sagar Patel, Mansi Nimje, Akshay Shetty, Sagar Kulkarni
Paper ID : IJERTCONV9IS03066
Volume & Issue : NTASU – 2020 (Volume 09 – Issue 03)
Published (First Online): 22-02-2021
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Personality Analysis using Social Media

Sagar Patel, Mansi Nimje, Akshay Shetty, Prof. Sagar Kulkarni

Department of Information technology,

Pillai College of Engineering, Navi Mumbai, India 410206

Abstract Social media has become a platform for users to present themselves to the world openly by revealing their personal views and insights on their lives. Hence, extracting information from social media and yielding insightful results about the person has become easier. We are beginning to understand that this information can be efficiently utilized to analyze the personality of the concerned person. In this paper, we aim to gain knowledge of the personality of a user by using the social media platform of the concerned user. These social media platforms could be Facebook or Twitter. Personality analysis can help to reveal many types of interactions: it can be used to predict a suitable job for a person and also know about his efficiency in the same; professional, romantic, his natures traits can also be studied. Personality analysis may even be able to detect the roots of any kind of suspicious, immoral or wrongful trait in a person.

KeywordsPersonality, social media, MBTI, Big Five Personality, analysis

BACKGROUND AND RELATED WORK

2.1 The Big Five Personality Model

Figure 1: Big Five Personality Model
1. INTRODUCTION
  
  Personality is defined as the characteristic set of behaviors, cognitions and emotional patterns that evolve from biological and environmental factors. While there is no generally agreed-upon definition of personality, most theories focus on motivation and psychological interactions with one's environment. Trait-based personality theories, such as those defined by Raymond Cattell define personality as The traits that predict a person's behavior. On the other hand, more behavioral-based approaches define personality through learning and habits. Nevertheless, most theories view personality as relatively stable.
  
  Since the inception of social media, a prodigious amount of status updates, tweets and comments have been posted online. The language people use to express themselves can provide clues about the kind of people they are, online and off the digital media. Some personality psychologists study publicly available social media data in addition to solicited surveys. However, they still start with predefined traits like extroversion, neuroticism or narcissism and correlate them with the writing. In other research, linguists have used algorithms to identify topics of conversation, but they do not have much to say about the personalities of the conversationalists. Hence research in this sector is important.
  
  The Big Five model of personality [12] dimensions is one of the most well-researched and well-regarded measures of personality structure in recent years. The models five domains of personality are Openness to experiences, Conscientiousness, Extroversion, Agreeableness, and Neuroticism.
  Before we actually began to start working on building our own system for Personality Analysis, we referred multiple research and papers from around the world. We studied and tried to identify all the drawbacks and flaws of the existing system architectures proposed. We realized that the predictive models must be scalable and dynamic to meet the requirements of ever-growing data and vast possibilities. Also, Very little work is done on semantic analysis for personality detection from tweets. It was identified that different approaches can be followed to improve recognition of linguistic constraints such as slang usage, communal bias, abbreviations and sentiment of tweets. Most of the work done is limited to the English language and hence the involvement of different languages is required.
  
  The authors of [1] have used the supervised learning approach to compute personality using SVM, [2] authors have explored the importance of twitter profiles in predicting the personality using Logistics Regression and neural network. Also the authors [3] have used SVM, NaiveBayes for personality prediction from written text. For the personality prediction of people in the USA use of images was made [5]. Also the use of NaiveBayes, LogisticsRegression, KNN and SVM is made for classification of different personality traits from social media [6][7], the authors [8] have worked on dutch, Spanish and Italian language for personality prediction, some authors [9][10] have used the deep learning approach using RNN and LSTM for getting semantic analysis for personality classification.
  - Hyperlink removal: As we are only dealing with the simple textual data, hyperlinks are of no use and are removed using a regular expression.
  - Emoticons handing: The emoticons are converted to text which adds to the quality of our training dataset.
  - Unwanted character removal: Unwanted characters like punctuation marks, numbers, multiple spaces and symbols which do not provide meaningful information are removed.
  - Stopword removal: In the dataset there are many stopwords like a, for, etc which does not provide meaningful information while training so we remove it by using the nltk stopword library.
  - Lemmatization: Grouping of different inflected forms of a word, called lemma is done (gone, going, went to go) since inflected forms of the same words carry one same meaning.
  - Stemming: Stemming is a crude heuristic process that chops off the ends of words or removes derivational affixes. To achieve stemming, a snowball stemmer was used.
  - Finally we have used the one-hot encoding technique to encode the set of four MBTI personality traits into 0s and 1s for classification tasks.
    1. DATASET
      
      The dataset used is from Kaggle[14]. This dataset is publicly available. It consists of tweets tagged with one of the 16 MBTI types. These tags are combinations of four
      
      4.2 Sampling
      
      Figure 3: Proposed system
      
      characters. The dataset consists of a total 8660 rows and 2 columns. The distribution of MBTI traits (number of rows out of 8600) in each class is as follows:
  - Introversion(I) : 6664; Extraversion(E) : 1996
  - Sensing(S) : 7466; Intuition(N) : 1194
  - Thinking(T) : 4685; Feeling(F) : 3975
  - Judging(J) : 5231; Perceiving(P) : 3429
    
    There are fifty posts included for every user. This data comes from uses of personalitycafe.com, an online forum where users first take a questionnaire that sorts them into their MBTI types.
    1. METHODOLOGY
      
      This section describes the methodology(figure 3) followed to obtain the personality scores on the MBTI scale:
  - N-grams: N-grams are contiguous sequences on n-items in a sentence. It helps to give more features of different grams and gain semantic analysis from the text.
  - Bag-of-words (BOW): Using the BOW technique different features related to each personality type are extracted based on words for classifying the personality traits at the word level.
  - Term Frequency and Inverse Document Frequency (TF-IDF): Term Frequency and Inverse Document Frequency or TF-IDF, produces the product of term-frequency and inverse-document-frequency. As our dataset is unbalanced for a few personality types, it results in having fewer words related to a few personality types. So to extract the features from low-level TF-IDF is useful.
  - Word2Vector (W2V): Word2vec takes as its input, a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. For getting the semantic analysis the W2V features extraction is used.
  - Glove Word Embedding: In the RNN model for our embedding layer, we use an embedding matrix to the 50-dimensional Glove representation of that word.
  After performing all the steps involving the data collection and its processing along with the critical training, evaluation and validation of the models mentioned above, the best model was selected. This final model having the highest performance was used in web deployment. A Web Application was developed to provide a simple user interface to take the personality test. Web application was developed with the Flask framework which helps well to serve the web and can be easily integrated with python which is used in the backend working. For passing the tweets of any user for personality prediction the TwitterAPI is used which gives the list of tweets of the specified user and then it is sent to the backend for preprocessing and also converting it to a similar format that of features extraction technique used during the training phase and then to saved model. Finally the results are displayed in a graphical format to the user at the user interface.
  1. RESULTS
    
    The goal of the proposed approach was to analyze and predict the personality traits of a person. A comparative study(Table 1) of various classifiers and feature vectors was done to obtain the most accurate prediction of the personality trait.
    
    KNN performs very badly and also it is computational very slow. Logistic Regression shows a better result compared to KNN. Naive Bayes classifier has greater performance than KNN and Logistic Regression for E_vs_I, S_vs_N, F_vs_T, and P_vs_J, where accuracy for P_vs_J is low as compared to others. RNN and SVM classifiers also reported with low performances for classifying the user personality types, as SVM and RNN require the larger dataset for better accuracy (Table 1).
    
    Table 1: Comparative study of algorithms
    
    From the above Table 1, the two models Naive Bayes and Logistic Regression classifier perform better than others. For selecting the one best model we consider the confusion matrix(Table 2)[15] and other classification metrics.
    
    Table 2: Confusion matrix of algorithms
    
    Figure 4 shows the graphical representation of the accuracy of Naive Bayes and Logistics Regression with different Features Vector.
    
    Figure 4: Comparison of model
    
    From Figure 4 it is clear that the NaiveBayes with TF-IDF UpSampling perform better than the other classifiers.
    
    Figure 5: GUI of Web
    
    Figure 6: GUI of Web
  2. CONCLUSION
    
    The system was able to gather the tweets of the user when given his twitter handle. The algorithms used were able to process the tweets and generate the required results. Finally the system was able to provide an efficiently accurate result by giving the personality type of the user.
    
    An interactive GUI was created to use the system.
  3. FUTURE WORK

The established system can be made more efficient by conducting more testing and feeding the system with a much accurate dataset. Once the system is highly accurate it can be used in corporations or even by the government to analyze the personalities of concerned individuals. The system can also be used in the crime sector. The currently implemented system of personality analysis can be extended and features like gender detection, age detection etc can be added to it.

ACKNOWLEDGMENT

It is our privilege to express our sincerest regards to our supervisor Prof. Sagar Kulkarni for the valuable inputs, able guidance, encouragement, whole-hearted cooperation and constructive criticism throughout the duration of this work. We deeply express our sincere thanks to our Head of the Department Dr. Satishkumar Varma and our Principal Dr. Sandeep M. Joshi for encouraging and allowing us to present this work.

REFERENCES

Giulio Carducci, Giuseppe Rizzo, Diego Monti, Enrico Palumbo, Maurizio Morisco, TwitPersonality: Computing Personality Traits From Tweets Using Word Embeddings and Supervised Learning, 2018
Mehul Smriti Raje(B) and Aakarsh Singh, Personality Detection by Analysis of Twitter Profiles, 2018
Srilakshmi Bharadwa j, Srinidhi Sridhar, Rahul Choudhary, Ramamoorthy Srinath, Persona Traits Identification based on Myers-Briggs Tpe Indicator(MBTI) – A Text Classification, 2018
Chaowei Li, Jiale Wan, Bo Wang, Personality Prediction of Social Network Users 2018
Shafaan Khaliq Bhatti, Asia Muneer, M Ikram Lali, Muqaddas Gull, Personality Analysis of the USA Public Using Twitter Profile Pictures
Bayu Yudha Pratama, Riyanarto Sarno, Personality Classification Based on Twitter Text Using Naive Bayes, KNN and SVM
Kanupriya Sharma, Amanpreet Kaur, Personality prediction of Twitter users with Logistic Regression

Classifier learned using Stochastic Gradient Descent 2015, IOSR Journal of Computer Engineering (IOSR-JCE)
Mounica Arroju, Aftab Hassan, Golnoosh Farnadi, Predicting Myers-Briggs Type Indicator with Text Classification
Hernandez, Rayne and Knight, Ian Scott, Predicting Myers- Briggs Type Indicator with Text Classification
Anthony Ma and Gus Liu, Neural Networks in Predicting Myers Brigg Personality Type from Writing Style
FIRE, Forum for Information Retrieval Evaluation.
https://en.wikipedia.org/wiki/Big_Five_personali ty_traits, Five Big Personality traits.
https://www.myersbriggs.org/my-mbti-personalit y-type/mbti- basics/home.htm?bhcp=1, Myers-Briggs Type Indicator.
https://www.kaggle.com/datasnaek/mbti-type, Kaggle MBTI Dataset.
https://scikit- learn.org/stable/auto_examples/model_selection/plot_confusion_ matrix.html,Confusi on matrix.

Personality Analysis using Social Media

Leave a Reply