Scalable Learning for Identify and Ranking Prevalent News Topic using Social Media Factor

S. Savitha; K. Logeswaran; N.Sowmiya; S.Suryaprakash; M.Tamilselvan; K.Tamilarasu

doi:10.17577/IJERTCONV7IS01034

RTICCT - 2019 (Volume 7 Issue 01)

Scalable Learning for Identify and Ranking Prevalent News Topic using Social Media Factor

DOI : 10.17577/IJERTCONV7IS01034

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 83
Total Downloads : 64
Authors : S. Savitha, K. Logeswaran, N.Sowmiya, S.Suryaprakash, M.Tamilselvan, K.Tamilarasu
Paper ID : IJERTCONV7IS01034
Volume & Issue : RTICCT – 2019 (Volume 7 – Issue 01 )
Published (First Online): 05-04-2019
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Scalable Learning for Identify and Ranking Prevalent News Topic using Social Media Factor

S. Savitha1, K. Logeswaran2,

1Assistant Professor, Department of CSE,

K.S.R. College of Engineering, Tiruchengode.,India

2Assistant Professor, Department of IT, Kongu Engineering College, Perundurai.,India

N.Sowmiya3,S.Suryaprakash4,M.Tamilselvan5, K.Tamilarasu6

3,4,5,6 U G Students, Department of CSE,

College of Engineering, Tiruchengode.

Abstract News media presents professionally verified occurrences or events, whereas social media presents the interests of the audience in these areas, and should therefore give insight into their quality. Social media services like Twitter can also provide additional or supporting information to a particular news media topic. Meanwhile, truly valuable information may be thought of as the area in which these two media sources topically intersect with each other. Unfortunately, even after elimination of unimportant content, there is still information overload in remaining news-related data, which must be prioritized for utilization. To assist in prioritization of news information, news must be ranked in order of estimated importance. At first, preprocessing is carried out. Key terms are extracted and filtered from news and social information admires a selected amount of your time. A graph is made (which is known as Key Term graph) from the antecedently extracted key term set, whose vertices represent the key terms and edges represent the co-occurrence similarity between them. The graph, when process and pruning, contains slightly joint clusters of topics in style in each print media and social media. Then the graph is clustered so as to get well-defined and disjoint sub graphs.The sub graphs from the main graph are selected and ranked based on user attention. Thus the thesis effectively identifies news topics that are prevalent in both social media and the news media, and then ranks them. The Louvain algorithm is used for communication detection for social media. Finally, the results are validated by using the Validation metrics Modularity and Edge density. Clique method is provides the best result, to compare with Louvain algorithm.

Keywords- Component; formatting; style; styling; insert (key words)

INTRODUCTION

Data mining or data discovery is that the computer- assisted method of dig through and analyzing huge sets of information and so extracting the means of the information. Data processing tools predict behaviors and future trends, allowing businesses to create proactive, knowledge-driven choices.

Data mining tools will answer business queries that historically were too time overwhelming to resolve. They scour databases for hidden patterns, finding prophetical data that consultants could miss as a result of it lies outside their expectations. Data processing derives its name from the similarities between looking for valuable data during a giant info and mining a mountain for a vein of valuable ore. Each process needs either winnowing through associate degree quantity of fabric, or showing intelligence searching it to search out wherever the worth resides. though data

processing continues to be in its infancy, firms during a big selection of industries – together with retail, finance, health care, producing transportation, and region – square measure already mistreatment data processing tools and techniques to require advantage of historical knowledge.

By mistreatment pattern recognition technologies, applied mathematics and mathematical techniques to shift through warehoused data, data processing helps to analyst acknowledge important facts, relationships, trends, patterns, exceptions and anomalies that may otherwise go unnoticed. For businesses, data processing is employed to find patterns and relationships within the knowledge so as to assist create higher business selections. Data processing will facilitate spot sales trends, develop smarter promoting campaigns, and accurately predict client loyalty.

Specific uses of information mining include:
- Market segmentation – determine the common characteristics of shoppers UN agency purchase constant product from your company.
- Customer churn – Predict that customers square measure seemingly to depart your company and visit a contestant.
- Fraud detection – determine that transactions square measure presumably to be dishonorable.
- Direct promoting – determine that prospects ought to be enclosed during a list to get the very best response rate.
- Interactive promoting – Predict what every individual accessing an internet website is presumably fascinated by seeing.
- Market basket analysis – perceives what product or services square measure unremarkably purchased together; e.g., brew and diapers.
Data Mining is that the method of analyzing unknown patterns of knowledge in line with totally different views for categorization into helpful information, that is collected and assembled in widespread areas, like knowledge warehouses, for economical analysis, data processing algorithms, facilitating business deciding and different data necessities to ultimately cut prices and increase revenue [https://www.techopedia.com/definition/1181/data-mining]. Selecting a Template (Heading 2)
RELATED WORKS

In the paper Toward Collective Behavior Prediction via Social Dimension Extraction [1] the authors Lei Tang and Huan Liu, Arizona State University within the year of 2010 were expressed that collective behavior refers to however people behave after they area unit exposed during a social network setting. Within the paper, they examined however they may predict on-line behaviors of users during a network, given the behavior data of some actors within the network.

They incontestible several benefits, particularly appropriate for large-scale networks, paving the manner for the study of collective behavior in several real-world applications. Social media like Facebook, MySpace, Twitter, BlogCatalog, Digg, YouTube and Flickr, facilitate folks of all walks of life to specific their thoughts, voice their opinions, and hook up with one another anytime and anyplace. for example, a well-liked content-sharing web site like Delicious, Flickr, and YouTube permits users to transfer, tag and comment differing types of contents (e.g., bookmarks, photos, videos).

Ones behavior is often influenced by the behavior of his/her friends. This naturally results in behavior correlation between connected users. Such collective behavior correlation can even be explained by homophily[5].

In this paper Finding community structure in networks mistreatment the eigenvectors of matrices [2] the author M.

E. J. Newman thought-about the matter within the year of 2006 were detective work communities or modules in networks, teams of vertices with a higher-than-average density of edges connecting them. Previous work indicates that a strong approach to the present drawback is that the maximization of the profit operates referred to as modularity over doable divisions of a network. Here the author showed that this maximization method are often written in terms of the eigen-spectrum of a matrix they referred to as the modularity matrix, that plays a job in community detection kind of like that contend by the graph Laplacian in graph partitioning calculations. They expressed that a typical feature of the many networks is Community Structure, the tendency for vertices to divide into teams, with dense connections inside teams and solely sparse connections between them.

In social networks, as an example, it's long been accepted that people United Nations agency lie on the boundaries of communities, bridging gaps between otherwise unconnected folks, get pleasure from AN uncommon level of inuence because the gatekeepers of data ow between teams [6, 7, 8].

In this paper Yes, there's a Correlation – From Social Networks to private Behavior on the Web [3] the authors Parag Singla and Matthew Richardson expressed that characterizing the connection that exists between a persons grouping and private behavior has been an extended standing goal of social network [9] analysts. They applied data processing techniques to review this relationship for a population of over ten million folks, by turning to on-line sources of knowledge.

The analysis reveals that folks United Nations agency chat with one another (using instant messaging) area unit a lot of doubtless to share interests (their internet searches area unit a similar or locally similar). The longer they pay talking,

stronger their relationship. People who chat with alternative

|one another} are a lot of doubtless to share other personal characteristics, like their age and site and, they're doubtless to be of opposite gender. Similar ndings hold for those that don't essentially talk over with one another however do have an admirer in common. Their analysis relies on a well-dened mathematical formulation of the matter, and is that the largest such study they were alert to.

In this paper BIRDS OF A FEATHER: Homophily in Social Networks [4] the authors Miller revivalist, Lynn Smith-Lovin and James M Cook expressed concerning Similarity breeds connection. This principle the homophily principle-structures network ties of each kind, together with wedding, friendship, work, advice, support, data transfer, exchange, co-membership, and different varieties of relationship. The result's that peoples personal networks area unit unvaried with respect to several socio demographic, behavioral, and intrapersonal characteristics. Homophily limits peoples social world during a manner that has powerful implications for the data they receive, the attitudes, and also the interactions they expertise.

Homophily in race and quality creates the strongest divides within the personal environments, with age, religion, education, occupation, and gender following in roughly that order. Geographic proximity, families, organizations, and isomorphous positions in social systems all produce contexts within which homophilous relations type. Ties between nonsimilar people additionally dissolve at a better rate that sets the stage for the formation of niches (localized positions) inside social area.

They argued for a lot of analysis on: (a) the fundamental ecological processes that link organizations, associations, cultural communities, social movements, and plenty of different social forms. (b) The impact of multiplex ties on the patterns of homophily and (c) The dynamics of network amendment over time through that networks and different social entities co-evolve.

SYSTEM DESIGN

Introduction

Social media and traditional media combined together, they feed off of each other and are mutually beneficial. Together, they create a much stronger, much more effective and successful marketing campaign. The intersection of two media information is used to project the popularity of the news at a particular period of time. Thus the analysis provides a strong opinion about particular news for decision making in future. This chapter states the problem of the community detection in large graph and outlines the overall view of the existing work
Existing Work

The concept of detecting the community from a large network in the existing work is given below
1. Girvan-Newman Clustering
  
  The GirvanNewman algorithm detects communities by increasingly removing edges from the original network. The associated components of the remaining network are the communities. Vertex betweenness
  
  is a pointer of greatly central nodes in network. For any node, vertex betweenness is defined as the number of shortest paths between pairs of nodes that run through it. If there is additional shortest path between a pair of nodes, each path is assigned equal weight such that the total weight of all of the paths is equal to unity. So the edges Note connecting communities will have high edge betweenness (at least one of them). By removing these edges, the groups are divided from one another and so the underlying community structure of the network is exposed.
  
  The algorithms step for community detection is summarized as follows
  
  Step 1: Find the edge of highest betweenness – or multiple edges of highest betweenness
  
  g (v) = g(v) (0,1) (1)
  
  Where,
  
  isthe total no. of shortest paths starting from beginning node s to ending node t
  
  is number of the shortest path through V. Step 2: The edge with highest betweennessvalue is removed.
  
  Step 3: Recalculate all betweenness, and again remove the edge or edges of highest betweenness.
  
  Step 4: Proceed in this way as long as edges remain in graph, in each step recalculating all betweenness and removing the edge or edges of highest betweenness.
  
  The betweenness centrality should be recalculated with each step. The reason is that the network adapts itself to the new conditions set after the edge is removed. For example, if two communities are getting connected by more than one edge, then there is no guarantee that all of these edges will have high betweenness. By recalculating betweenness after the removal of each edge, it is ensured that at least one of the remaining edges between two communities will always have a high value [https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_ algorithm].
  
  Drawbacks of Girvan-Newman clustering
  - Girvan-Newman clustering is too slow for large networks
  - It yields relatively poor result for dense network
  - It takes more computation time to partition the large graph
  - The clustering approach is not employed in order to obtain overlapping topic clusters
PROBLEM DEFINITON

Twitter is an American online news and social networking service on which the users post and interact with messages known as tweets. Registered user can post tweets, talk about news and share interesting topics via social network services but those who are unregistered can only read them [https://en.wikipedia.org/wiki/Twitter].

The news media (traditional media such as web news crawls, website news forums) contain the professionally verified events. The valuable information is obtained by intersecting two media sources. The community is detected to

discover how particular topic is discussed by the user. It can be used to provide the strong opinion on the particular news present in the media. Detecting the communities from the large network is not an easy task.

In the existing system, the community is detected by using the Girvan-Newman clustering method, which detects communities in smaller graphs. Based on the betweenness [https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_ algorithm]
In order to accomplish the detection of communities from the large graph the proposed work is done. The proposed system utilizes two methods namely, CLIQUE (CLusterInQUEst) detection and Louvain method to detect the communities effectively. In CLIQUE detection, it uses the multi-resolution grid data structure. The cluster contains the maximal set of actors in which every actor is connected to each other. It generates the minimal number of description for the clusters. The Louvain method is a greedy optimization method. It allows to efficiently compute the edge ranking in large network in linear time. Finally it discovers the community structure by optimizing the modularity of the network.

Implementation Tool

The implementtion tool employed in the present work isR.

R could be an artificial language and free code surroundings for applied mathematics computing and graphics supported by the R Foundation for applied mathematics Computing. The R language is wide used among statisticians and information miners for developing applied mathematics code and information analysis. R includes a command interface; there are many graphical user interfaces, like RStudio, AN Integrated development surroundings. R is AN implementation of the S artificial language combined with lexical scoping linguistics, impressed by theme.Some of the R packages employed in current work area unit as follows

Packages	Description
RColorBrewer	Palettes for thematic maps
Tm	Framework for Text mining
TwitterR	Access to the Twitter API
WordCloud	Plot a cloud of words
RoAuth	R open Authentication
NLP	For Natural Language Processing
SnowBallc	For stemming the words
RCurl	Request URL
TextmineR	Create corpus
Textclean	Normalizing and cleaning the text
Igraph	Fast handling of large graphs
Syuzhet	Quickly extract the plot
Plyr	Splitting big data structure, apply function and combine all together
XML	For paring and integrating XML

Summary

This chapter describes about the problem definition and the overview of existing algorithm Girvan-Newman clustering. Consider the drawbacks of the existing work; the current work uses the CLIQUE detection and Louvain community detection model to get best community structure than the existing work, which is described in the next chapter 4, System

Internet

SYSTEM METHODOLOY
1. Introduction
  
  The keywords of the combined sources (Twitter and News media) help to find out the intersection of the words and co-occurrence words, which help to create the news term graph whereas the vertices are the text and the edges are the relationship, exist among vertices. So that the community detection algorithm detects the dense region that are frequently crawled information in news and twitter. CLIQUE and Louvain method is used to detect the better communities than the existing method. This chapter describes the proposed algorithm in the current work.
2. SYSTEM ARCHITECTURE
  
  The modules in the current work are as follows
  - Dataset collection
  - Preprocessing
  - Key term graph construction
  - Key term similarity estimation
  - Graph clustering : Girvan-Newman clustering
  - Content selection : User Attention (UA)
  - CLIQUE detection algorithm
  - Louvain algorithm
  - Performance evaluation
    1. DATASET COLLECTION
      
      NEWS DATA:
      
      The BBC newswebsite (https://www.bbc.com/news) contains international news coverage, as well as British, entertainment, science, and political news. Many reports are convoyed by audio/video from BBC's television/radio news services. It is providing the interdisciplinary fields of news such as world, port, weather, travel, business, entertainment, health, science, technology and so on. For the current work the sports category news information from the date 1.11.2018
      
      DATABASE
      
      DATABASE
      
      News articles
      
      Twitter post
      
      News articles
      
      Twitter post
      
      User
      
      User
      
      News Keywords
      
      News Keywords
      1. Preprocessing
        
        Twitter Keywords
        
        Twitter Keywords
      2. News Term graph construction
      3. Girvan-Newman Clustering
      4. Content Selection (User Attention Estimation)
        
        Proposed community structure detection models
        
        to 30.11.2018 are downloaded from the website (https://www.bbc.com/news/sports).
        
        TWITTER DATA:
        
        Edge betweenness Modularity
        
        Edge betweenness Modularity
        
        Clique
        
        Clique N- Clique
        
        Clique
        
        Clique N- Clique
        
        Twitter account is used to create the Application Program Interface (API). The API provides the consumer secret key and access token secret key for authenticated retrieval of tweets. The number of tweets related to sports news is collected. The figure 4.1 shows the overall
      5. CLIQUE Detection model
      6. Louvain method
        
        architecture of the current work. The modules in the work are explained below.
    2. PREPROCESSING
      
      The collected news articles and the tweets are preprocessed in this step.
      
      The word data is plural, not singular.
      
      Figure 4.1. The overall architecture of the current work
    3. Key Term Graph Construction
      
      A graph G is generated, whereas the clustered nodes represent the prevalent news topic in both news and social media. The vertices in the graph G is the terms retrieved from N and T and the edges exhibit the relationship among the
      
      nodes. The following methods are used to find out the relationships between the words.
      - Term Document Frequency
        
        The document frequency of each term in News and Twitter is calculated accordingly. Here df(n) is the occurrence of term n and df(t) is the occurrence of term t.
      - Relevant Key Term Identification
      N represents the keywords present in the news article and
      
      All of the formerly described similarity measures generate a value between 0 and 1.
    4. Graph Clustering: Girvan-Newman
      
      (6)
      
      T represent all relevant term present in the tweets. To extract the topics that are prevalent in both news and social media, the following formula is used.
      
      I = N T
      
      (2)
      
      This intersection of N and T eliminates the terms from T that are not relevant to the news and terms from N that are not mentioned in the social media. I (intersection words) are ranked based on their prevalence in both sources. The prevalence of a term is the combination of it occurrence in both N and T.
      
      This algorithm used to find out the word clusters. The goal is to identify and separate the well defined sub graphs in the graph.Betweenness
      
      The core idea of Newman clustering is the concept of edge betweenness. The betweenness value of an edge is the number of shortest paths between pairs of nodes that run along it. The betweenness measure of an edge e is calculated a follows,
      
      Betweenness (e) = (7)
      
      Where,
      
      (3)
      
      Where,|T| is the total number of tweets chosen between dates d1 and d2.|N| is the total number of news chosen in the same period of time.
      
      Key Term Similarity Estimation
      
      The perception behind the co-occurrence is the terms that co-occur frequently are related to the same topic and may be used to summarize and represent it when grouped. The co- occurrence for each term pair (i,j) I found, defined as co(i,j). The term-pair co-occurrence is then used to estimate the similarity between terms. A number of similarity measure were tested, namely Jaccard, Dice and Cosine similarity.
      
      The Dice similarity between term I and j is calculated as follows,
      
      (4)
      
      Where,
      1. is the number of tweet that contain term i
      2. is the number of tweet that contain term j
      co(i,j) is the number of tweets in which terms i and j co- occur in
      
      is a threshold used to discard whose similarity that fall below it
      
      The Jaccard similarity between term I and j is calculated as follows,
      
      (5)
      
      The Cosine similarity betwee term I and j is calculated as follows,
      
      V is the set of vertices
      
      is the number of shortest path between vertex i and j
      
      is the number of those paths that pass through edge e.
      
      1. Transitivity
      
      It is a property in a relation between three elements such that if the relation holds between the first and second elements, and between the second and third elements, then it also holds between the first and third elements. The transitivity of a graph G is defined as
      
      Transitivity (g) = (8)
      
      Algorithm Girvan – Newman
      
      Improve the Cluster Quality of a Graph Input: Graph G
      
      Output: Cluster-quality-improved G B = {} empty set
      
      repeat
      
      for all (edge e G) do
      
      Calculate betweenness(e) and append to B end for
      
      if rst iteration of loop then
      
      bavg = avg(B)
      
      end if
      
      bmax = max(B)
      
      trans0 = transitivity(G) previous transitivity Remove edge with bmax from G
      
      trans1 = transitivity(G) posterior transitivity Clear set B
      
      until (trans1 < trans0 or bmax<bavg) Add edge with bmax to G
      
      Step 1: The betweenness values of all edges in graph
      
      G are calculated.
      
      Step 2: The initial average betweenness of graph G is calculated.
      
      Step 3: The high betweenness values are iteratively removed in order to separate clusters.
      
      Step 4: The edge removing process is closed when removing additional edges yields no gain to the clustering quality of the graph. Once the process has been topped, the last detached edge is added back to G.
    5. Content Selection : User Attention
      
      The User Attention (UA) represents the number of unique Twitter user related to the selected tweets. The tweets related to that topic are selected and then the number of unique users who created those tweets are counted. The equation for finding the UA is given below.
      
      (9)
      
      (OR)
      
      UA = (10)
      
      Where,
      
      is the number of unique users related to TC G is the entire graph
      
      This equation produces a value between 0 and 1.
    6. Clique Detection Algorithm
      
      The CLIQUE algorithm was one of the first subspace clustering algorithm. It identifies dense clusters in maximum dimensionalitys subspaces. The algorithm unites density and grid based clustering. It uses an APRIORI style search technique to detect dense subspaces. Then the algorithm finds adjacent densegridunitsineachofselectedsubspacesusingdepthfirstsear ch.Clustersarethen formed by uniting these units with the help of a greedy growth scheme. The algorithm begins with an arbitrary dense unit and then greedily produce sa maximal region in each dimension until the uniono fall there gions covers the entire cluster. Redundant regions are removed by a repeated procedure.
      
      The region growing, density based approach to generating clusters allows CLIQUE to find clusters of arbitrary shape, in any number of dimensions. Clusters are found in same, disjoint or overlapping subspaces. This is often advantageous in subspace clustering since the clusters often exist in different subspaces and thus represent different relationships.
      
      CLIQUE, consists of the following steps:
      1. Identification of subspaces that contain clusters
      2. Identification of clusters
      3. Generation of minimal description for the clusters
        
        Algorithm steps for CLIQUE
        
        Identification of subspace that is dense
        
        Finding of dense units
        
        Find the set D1 of all one dimensional dense unit K=1
        
        While = K=k+1
        
        Find the set which is set of all k-dimensional dense units whose all lower dimension projections (k-1), belong to
        
        End while
        
        Finding sub spaces of high coverage
        
        Identification of clusters
        
        For each high coverage subspace s do Take the set of all dense units ( E in S) While E! =
        
        M=1
        
        Select a randomly chosen unit u from E
        
        Assign , U and all units of E that are connected to U
        
        E = E-
        
        End while End for
        
        Generate minimal cluster descriptions For each cluster C do
        
        Stage C=0
        
        While c! = X = X+1
        
        Choose a dense unit in C For i = 1 to L
        
        Unit proceeds in both the direction along the dimension.
        
        End for
        
        Represent the set containing the entire unit covered by the above procedure
        
        C = C-1
        
        End while Stage
        
        Remove all covers from the units covered by another
        
        cover
    7. Uvain method
    The Louvain method is simple, effective and easy-to- implement method to identify communities in large networks. The method is used along with success for networks of different types and for sizes ranging upto100 million nodes and billions of links.
    
    The method consists of two phases.
    1. It looks for "small" communities by optimizing modularity in a local way.
    2. It aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are iteratively repeated until a maximum of modularity is attained.
    The partition found after the first step typically includes many communities of small sizes. At succeeding steps, larger and larger communities are found due to the aggregation mechanism. This process will naturally lead to hierarchical decomposition of the network.This is clearly associate approximate technique and ensures that the world most of modularity is earned, however many tests have confirmed that our algorithmic rule has a wonderful accuracy and sometimes provides a decomposition in communities that encompasses a modularity that's about to optimality.
    
    A graph G = (V,E) is created where V and E are the sets of nodes and edges. Community detection is performed by dividing graph G into clusters C = {V1,V2, … ,Vx} and each Vi , a set of nodes, is called community. The figure 4.2 shows the large network of nodes and edges which is clustered using Louvain algorithm; the communities are differentiated by using various color nodes.
    
    Figure 4.2. Louvain community detection method Louvain algorithm (Graph G)
    
    = G
    
    C the index of community of each nodes of Initialize each node with its own community q = –
    
    while q< ,G) do
    
    q = Q (
    
    c = MoveNodes ( // Phase 1
    
    = Aggregate ( // Phase 2
    
    C = put each node in its own community End while
    
    Return End function
    
    Function MoveNodes (Graph G)
    
    C the index of communities of each nodes of G While one or more nodes are moved do
    
    for random v do best_q = –
    
    best_c = community of v
    
    for all neighboring nodes n of v do gain_q = between v and n ifbest_q<gain_q< then
    
    best_q = gain_q
    
    best_c = community of n end if
    
    end for
    
    C = place v in the best_q end while
    
    return c
    
    end function
    
    Function Aggregate (Graph G, Partition C)
    
    = aggregate nodes which are in same community based on C
    
    Return End function
    
    This is an iterative algorithm repeating till there is no additional modularity improvement. It begins with initialization ofall nodes with its own community. In Phase 1, for every node in a graph, it computes modularity gain Q for all neighboring communities if the node found to be moving.
    
    Q indicates gain of modularity and is defined by
    
    – (11)
    
    where is the sum of the weights of the links inside the community to which the node i is assigned, is the sum of weights of the links incident to community nodes, and is the sum of the weights of the links from i to nodes in the community which is same with the community of node i.
    
    In the Phase 2, all communities are collapsed to the vertices to create a new graph internal community edges are collapsed into a single self-looping edge, and the weight is the sum of edge weights of the entire internal community edges in the community. Multiple edges between every two communities are collapsed to form a single edge, and weight is the sum of edges between them.
3. Summary
This chapter as described about the community detection using CLIQUE and Louvain method. CLIQUE detection model finds the communities of minimizing dimensionality. And the Louvain method finds the communities by maximizing the modularity. The current work determines that the CLIQUE clustering find out the best communities than the Louvain method. The results obtained by using CLIQUE and Louvain method are given in the next Chapter 5, Results and discussion.
RESULTS AND DISCUSSIONS
1. Experimental analysis
  
  The existing and proposed work detects the community structure from the news media and twitter media. The Process of existing and proposed work contains the following steps:
  
  Step 1: The input news data is first downloaded from the BBC news portals (http://www.bbc.com/) and tweets are collected by using the Twitter API.
  
  Step 2: The keywords of the news and twitter media is generated separately. The intersection of the two media keywords is found. The frequency of the words is calculated using TF-IDF.
  
  Step 3: The relationship between the keywords can be found by using three similarity measures namely Dice, Jaccard and cosine similarity measures. So the vertices are the text words which are connected by the edges.
  
  Step 4: The vertices and edges forms the clusters that are obtained by using the Girvan-Newman clustering method. And the User Attention (UA) of the resultant cluster is calculated.
  
  Step 5: The resultant graph obtained in step 4 is fed into the CLIQUE community detection and Louvain method as input.
  
  Step 6: At last the dense community graph is produced as output.
  
  Step 7: Finally, the edge density and modularity is calculated to evaluate the quality of the community structure.
2. Comparison result
  
  Based on the number of twitter and news keywords, different community structure has been obtained. The performance is evaluated by increasing the number of news, check whether the methods yield finest community even for large graphs. Among the three methods, the CLIQUE method yields better (strengthen) community structure with the high rate of evaluation metrics, even the news and tweets are increased to some extent. The results for different number of input are explained in graphs.
  
  No. of Twitter and News Post
  
  100 N
  
  + 400 T
  
  200 N
  
  + 500 T
  
  250 N
  
  + 1000 T
  
  ALGORITH M
  
  Newman Cluster
  
  3
  
  4
  
  5
  
  Louvain Cluster
  
  3
  
  2
  
  3
  
  Clique Cluster
  
  1
  
  24
  
  34
  
  In the table 5.1, the number of clusters created according to the number of tweets and news are given below.
  
  Figure 5.1. Resultant Graph for Cluster size
CONCLUSION

Classification is applied to developed to automatically analyze the emotional polarity of a text, based on which a value for each piece of text is obtained. The absolute value of the text denotes influential power and the sign of text denotes its emotional polarity.

This Graph clustering is applied to develop integrated approach for online sports forums cluster analysis. Clustering algorithm is applied to classify the forums into various

clusters, with the middle of each cluster representing a hotspot forum within current time span.

Along with clustering the forums based on data from current time window, conducted forecast is also conducted for the next time window. Empirical studies give strong proof of existence of correlations between post text sentiments and hotspot distributions.

Education Institutions being information seekers benefit from hotspot predicting approaches in various ways. They followed the same rules as academic objectives, and are measurable, quantifiable, and also time specific. However, in real, parents/students behavior is always hard to be capture explored.

Using the hotspot predicting approaches can help the education institutions understand what their specific customer's timely concerns regarding goods and services information. Results generated from these approaches can be combined to competitor analysis to defer comprehensive decision support information.
REFERENCES

L. Tang and H. Liu,Toward predicting collective behavior via social dimension extraction, IEEE Intelligent Systems, vol. 25, pp. 1925, 2010.
M. Newman, Finding community structure in networks using the eigenvectors of matrices Physical Review (Statistical, Nonlinear and Soft Matter Physics), vol. 74, no.3,2006.
P. Singla and M. Richardson, Yes, there is a correlation: – from social networks to personal behavior on the web, in WWW 08: Proceeding of the 17th international conference on World Wide Web. New York, NY, USA: ACM, 2008, pp. 655664.
M. McPherson, L. Smith-Lovin, and J. M. Cook, Birds of a feather: Homophily in social networks, Annual Review of Sociology, vol. 27, pp. 415444, 2001.
M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27:415444, 2001.
M. Granovetter, The strength of weak ties. Am. J. Sociol. 78, 1360 1380 (1973).
R. S. Burt, Positions in networks. Social Forces 55, 93 122 (1976).
L. C. Freeman, A set of measures of centrality based upon betweenness. Sociometry 40, 3541 (1977).
P. Doreian and T. Snijders, editors. Social Networks, 2006.

No. of Twitter and News Post	100 N + 400 T	200 N + 500 T	250 N + 1000 T
ALGORITH M
Newman Cluster	3	4	5
Louvain Cluster	3	2	3
Clique Cluster	1	24	34

Scalable Learning for Identify and Ranking Prevalent News Topic using Social Media Factor

DATABASE

DATABASE

Leave a Reply