LIMTM: A Framework for Assimilating Link Based Importance in to Semantically Coherent Clusters of Correlated Words

DOI : 10.17577/IJERTCONV3IS12008

Download Full-Text PDF Cite this Publication

Text Only Version

LIMTM: A Framework for Assimilating Link Based Importance in to Semantically Coherent Clusters of Correlated Words

Kanagadurga Natarajan

Department of computer science and engineering,

A.V.C College of Engineering, Tamil Nadu, India.

Abstract – As more information becomes available, it is getting harder and harder to find what we are looking for like finding a needle in the haystack. In todays world where most of the information is electronically stored new tools which help to organize, search and understand information are need of the hour. Topic models are used to discover hidden topic based patterns. Using those discovered topics collection of documents can be annotated. Using those annotations documents can be organized, understood, summarized and searched. Topic modeling has become a well known text mining method and is widely used in document navigation, clustering, classification and information retrieval. Given a set of documents, the goal of topic modeling is to discover semantically coherent clusters of correlated words known as topics, which can be further used to represent and summarize the content of documents. By using topic modeling, documents can be modeled as multinomial distributions over topics instead of those over words. Topics can serve as better features of documents than words because of its low dimension and good semantic. Interpretability Topic modeling has become a widely used tool for document management. However, there are few topic models distinguishing the importance of documents on different topics. A framework LIMTM (Link based Importance into Topic Modeling) is used to incorporate link based importance into topic modeling. Specifically, ranking methods are used to compute the topical importance of documents.

Keywords-Text Corpus, Topic Modeling, Link based Importance, Ranking, and Log Likelihood

  1. INTRODUCTION

    Topic modeling is a widely known text mining method and is conventionally used in document navigation, clustering, classication and information retrieval because of its propitious application performance. The objective of topic modeling is to uncover semantically coherent clusters of correlated words known as topics from a collection of documents. The discovered topics can be further used to represent and summarize the content of documents. The customary topic models include PLSA (Probabilistic Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation). By deploying topic modeling, documents can be modeled as multinomial distributions over topics instead of those over words. Topics are better features of documents than words due to its low dimension and good semantic interpretability.

    Existing topic models do not explicitly discriminate the importance of documents on different topics, while in non-theoretical situations documents have different degrees of importance on different topics, thus viewing them as equally important may inherently hurt the performance of topic modeling. Ranking methods are initially proposed for the purpose of ranking web pages. Since concepts and entities in those domains are similar it can be also used to rank other kind of documents. To specify the importance of documents on different topics, topical ranking methods can be used which is an extension of basic ranking algorithms such as page rank and HITS (Hyperlink-Induced Topic Search). The proposed work incorporates link based importance into topic modeling.

    Topical ranking method computes the importance scores of documents over topics which are then leveraged to aid the topic modeling process. The proposed framework, Link Importance Based Topic Model denoted as LIMTM for short. When comparing to existing topic models, LIMTM discriminates the importance of documents while accomplishing topic modeling. The ideology behind the methodology is that the more important documents are given more weights than the less important ones.

  2. RELATED WORK

    Topic Models

    The prior link combined topic models can capture the topical correlations between linked documents but does not leverage the topical ranking of documents to aid the topic modeling process. The proposed framework closely resembles the TopicFlow model. The distinguished features of proposed work from TopicFlow lie in the following two folds. First, LIMTM provides a more exible combination between link importance and topic modeling while TopicFlow couples ow network and topic modeling tightly. This feature makes LIMTM more extendable. Second, LIMTM builds a generalized relation between link importance and topic modeling rather than a hard relation like TopicFlow.

    Ranking

    (1) + (1 )(1)

    The proposed work is tightly related to ranking technology. The most customary link based ranking

    () =

    ||

    .

    + (1 )

    (2)

    =

    =

    algorithms are PageRank and HITS. The other commonly used ranking algorithms include FurtureRank, P-Rank and RankClus. When compared to RankClus which performs ranking based on hard clustering, LIMTM incorporates link based importance into topic modeling which is a soft clustering. Another difference is that RankClus is a clustering algorithm based on only links while LIMTM is a topic modeling framework based on both links and texts.

    Where and are parameters that control the process of prorogating the ranking score, which are both empirically set to 0.85. . = 1 denotes the global importance of document , is the set of in-link neighbors of document , | |denotes the number of out-link neighbors of document , and is the topic proportion of document on topic and M is the total number of documents.

    Topic Modeling

  3. PRELIMINARIES

  4. LIMTM FRAMEWORK

    Relation between link importance and topic modeling

    The objective of topic modeling is to extract conceptually coherent topics that are shared by a set of documents. LIMTM framework is built over PLSA topic model. LIMTM can be regarded as PLSA with informative prior.

    Given a collection of N documents D, let V be the

    The link importance can be interpreted as the probability (|) of the node involved in the topic by normalizing the importance vector such that (|) = 1, . By using the sum and product rules of the Bayesian theorem, the topic proportion (|)can be expressed in

    =1

    =1

    terms of .

    total number of unique words in the vocabulary and K

    denote the number of topics, the goal of PLSA is to

    = (|) = (|)()

    =

    (3)

    maximize the likelihood of the collection of documents

    =1

    (|)()

    =1

    with respect to model parameters and B. Where = () is the prior probability of topic .

    (|, ) = ( )

    (1)

    To reduce the effects of noise, the degree of belief on the link importance is modeled instead of removing the

    =1 =1 =1

    noise links. A parameter is introduced ranging from 0 to 1 to indicate belief on the link importance.

    Where = {} N*K is the topic distribution of documents, B = {} K*V is the word distribution of topics, and represents the times that word w occurs in document .

    as .

    = (|)[ + (1 )] (4)

    Where = (|) has the same interpretation

    After the conjecture of PLSA, each topic is represented as a distribution over words in which top probabilitywords form a semantically coherent concept, and each document can be represented as a distribution over the discovered topics.

    Topical Link Importance

    The documents global importance computed based only on the link structure of the document network is given by Link importance. Topical pagerank and topical HITS (Topical link analysis) are proposed by LIMTM.

    As the input of topical pagerank, each document is associated with a topic distribution , which can be obtained via topic modeling methods. Taking into account, topical pagerank produces an importance vector for each document, in which each element represents the importance score of the document on each topic. Letting

    denote the importance of document on topic , topical pagerank is formally expressed as

    LIMTM Framework

    Different from the traditional topic models, the probability (|) of a document involved in a topic is governed by the weighted mixture of topical ranking ,

    and the hidden variable in the LIMTM model such that the effects of link importance on topic modeling is integrated.

    Fig 1. LIMTM Framework

    In LIMPTM, the topical link importance of documents is labeled as observational variable since it can be obtained by the topical pagerank or topical HITS algorithm, although in an overall view topical link

    importance is in fact unknown. By incorporating topical link importance into the topic modeling, the link

    , [() log + (1

    information is naturally taken into account since the topical

    link analysis process is performed on the link structure.

    =1 =1

    =1

    )() log

    ]

    In LIMTM Framework, the likelihood of a

    collection of documents D with respect to the model

    + ( 1) + ( 1)

    parameters is

    =1

    =1

    =1

    (|, , , ) = ([

    + ( 1) (9)

    =1 =1

    =1

    =1

    =1

    + (1 )] )

    (5)

    The maximization problem has a closed form which gives out the update rules that monotonically increase L.

    Derivation of LIMTM

    (+1)

    ()

    The maximum likelihood estimation is adopted to derive the model parameters involved in LIMTM which is

    =1

    obtained by the expectation maximization (EM) algorithm.

    The logarithm of the likelihood function is

    (+1)

    =1

    =1

    ()

    = log (|, , , )

    (+1) =1 () (10)

    = 1 =1 log 1 [ +

    = =

    (1 )] (6)

    In the E step the posterior distribution (|, ) of topics conditioned on each document-word pair (, ) is computed by

    () = ()(|, )()[ + (1 )()]() (7)

    As the parameter updating process converges, the topic proportion can be computed by removing noise.

  5. CONCLUSION

Thus the proposed LIMTM framework has incorporated Link based importance into Semantically

coherent cluster of correlated words known as Topics as a

Then, the lower bound of L can be derived by using Jensen inequality twice as follows,

unified framework. LIMTM framework distinguishes the importance of documents on different topics using Topical PageRanking algorithm. The performance of Topical

[

+ (1 )

]

PageRank lies in the features such as generalization

log

log

=

=1 =1

()

=1

()

performance, document clustering and classication, topic interpretability, and document network summarization

performance. Topic modeling provides flexibility because

()

log [ + (1 )

]

it can be used according to specific application requirements.

=1 =1

=1

()

()

log

log

=1 =1

=1

ACKNOWLEDGEMENT

=1 =1 =1[() log + (1

)() log ] =1 =1 =1 () log ()

I highly indebted to provide my heart full thanks to my respectful Head of the Department of Computer Science

(8)

and Engineering Mr.T.ANAND M.E.,(Ph.D.) who has been a constant source of inspiration. I express my sincere gratitude to Mr.A.ARULMURUGAN M.Tech., for his

=

=

and

and

In the M-step, the lower bound of L is maximized under the constraint =1 = 1 , 1 = 1

valuable ideas, constant mentorship and supportive guidance

=1

= 1. Through Lagrange multipliers, the

constrained maximization problem is converted to the following.

REFERENCES

  1. Alper Kursat Uysal and Serkan Kunal, The Impact of Preprocessing on Text Classification, IEEE Transactions on Information processing and management, Volume 50, Issue 1, January 2014.

  2. Dongsheng Duan, Yuhua Li, Ruixuan Li, Rui Zhang and Aiming Wen, Rank Topic: Ranking Based Topic Modeling, IEEE Transactions on Data Mining, Volume 12, December 2013.

  3. Jia Zeng, William K. Cheung, Chun-hung Li and Jiming Liu, Multi- relational Topic Models, IEEE Transactions on Data Mining, Volume 7, December 2012.

  4. Mark A. Greenwood, Implementing a Vector Space Document Retrieval System, Dept. of Computer Science, University of Sheffield, Electronic copy and resources available from http://www.dcs.shef.ac.uk/~mark/ follow the PhD link.

  5. Michal Rosen-Zvi, Thomas Griffi ths, Mark Steyvers and Padhraic Smyth, The Author-Topic Model for Authors and Documents, published in UAI '12 Proceedings of the 20th conference on Uncertainty in artificial intelligence.

Leave a Reply