Exploration of Several Page Rank Algorithms for Link Analysis

Bipin B. Vekariya; Ankit P. Vaishnav

doi:10.17577/IJERTV3IS20186

Volume 03, Issue 02 (February 2014)

Exploration of Several Page Rank Algorithms for Link Analysis

DOI : 10.17577/IJERTV3IS20186

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 71
Total Downloads : 412
Authors : Bipin B. Vekariya, Ankit P. Vaishnav
Paper ID : IJERTV3IS20186
Volume & Issue : Volume 03, Issue 02 (February 2014)
Published (First Online): 11-02-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Exploration of Several Page Rank Algorithms for Link Analysis

Bipin B. Vekariya

Student, Computer Engineering Department, Dharmsinh Desai University,

Nadiad , Gujarat, India.

Ankit P. Vaishnav

Assistant Professor, Computer Engineering Department, Dharmsinh Desai University,

Nadiad , Gujarat, India.

Abstract Web is the largest collection of information and it is growing continuously. Pages& Documents are added and deleted on frequent basis due to dynamic nature of the web. The web is serving as the major source of meaningful information related to query made by the user. The search engine applies different algorithms for link analysis to fetch most relevant pages and documents which are presented at the top of the result list. To assist the users to navigate in the result list, ranking methods are applied on the search results. Most of the ranking algorithms proposed in the literature are PageRank (PR) [1], Weighted PageRank (WPR) [5], Hyperlink-Induced Topic Search (HITS) [4], Page Ranking Algorithm Based On Number Of Visits Of Links Of Web Page [7],An Improved Page Ranking Algorithm Based On Optimized Normalization Technique [6],Improved method for computation of PageRank[8], Weighted Page Ranking Algorithm Based On Number Of Visits Of Links Of Webpage [9].With the understanding that Page Rankings for searching meaningful and relevant information from gigantic collection of web pages and numerous hyperlinks, hence primary purpose of this paper is to come up with the page rank method that gives more relevancy and accurate information to the user. i.e. Weighted Page Ranking Algorithm Based On Number Of Visits Of Links Of Webpage.Above Algorithms are reviewed, theoretically compared & analysed in order to reach a result about the most suitable algorithm for PR by observing the parameters in it.

Keywords Pagerank, HITS, WPR, Pagerank With VOL, Authoritative Web Pages, Web Structure Mining.

INTRODUCTION

WWW is the biggest collection of information. It is the biggest and most widely known information source that is easily accessible and searchable. It consists of billions of interconnected documents called web pages, which are developed by millions of people. World web is growing continuously in last few years in terms of number of web pages. WebPages are added and deleted on frequent basis due to dynamic nature of the web. Figure 1 shows the structure of a classic Web graph.Structure of a classic Web graph can be seen as web pages as nodes, and hyperlinks as edges which connects two related nodes. The web serves as the key resource of meaningful information depending on users request.

Fig.1: Classic Web Structure

Search engine in the simplest manner can be elaborate as a kind of system, which gives response with related information in the form of web pages & documents as per users request. Google, MSN, Yahoo and other search engines keep their billions of web pages indexed & hosted on their servers across the world to serve billions of users. A user create different queries to indicate a search topic and after that in response the search engine identifies several web pages having related content that satisfies users queries, e.g. web pages that match with the key words of users queries. These pages are known as the result set.Quality of search engines can be evaluated not only on the number of pages that are indexed, but also on the usefulness of the ranking process that determines which pages are returned.To fetch web pages the search engine applies different algorithms for link analysis and after that the result is presented in the form of list of web pages. Figure 2 shows mechanism of basic search engine.

Fig.2: Basic Search Engine Mechanism [6]
WEB MINING

Aim of web mining is to discover relevant and meaningful information or knowledge from different resources. These resources mainly involve page content, hyperlink structure and usage data.

Web mining exploits several data mining techniques in order to search and extract information from WWW.

Web mining is classified into the major three categories:
1. WEB CONTENT MINING (WCM)
  
  WCMextracts meaningful, useful and relevant information or knowledge from web page contents like text, image, audio, video, metadata, hyperlinks and extracts useful information. Web content mining can be further categorized into Web page content mining and search results mining.
  
  Web page content mining is conventional searching of web pages with the help of content, while Search results mining is an additional search of pages found from a previous search.
  
  WCM techniques can be exploit as concept hierarchies, synonyms, user profiles and analyzing the links between pages to improve web content mining. Web content mining uses data mining techniques for efficiency, effectiveness and scalability.
2. WEB STRUCTURE MINING
The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as edges connecting between two related pages.

Web structure mining tries to discover useful knowledge

Many of the researchers have contributed several techniques to serve the said purpose for it. Following are some of the related work accomplished by different researchers groups. In this section several approaches are briefly described. As the primary evaluation parameter of this paper is Page Ranking, only DifferentRanking methods in link analysis is explained. Other operations of Searching are not explained. One can go through the references for complete description of these methods.

A. Page Rank Algorithm [1]
This algorithm was developed by Brin and Page at Stanford University which extends the idea of citation analysis. A clear vision to PageRank is the calculation of impact of research publications in terms of the number of citations they have.

In the Web domain a page has a citation when there is another Web pages linked to it. From the point of view of the cited page this link is called a back-link. PageRank does somewhat more than just totalling back-links. It assigns different weights to back-links. Thus, a Web page can have a high rank if it has many back-links (citations) as well as if it has only few but highly rated back-links.

As shown in below formula u is taken as a webpage and B (u) the set of pages that indicate to u. Then v is taken as a webpage and Nv is the number of links from v. Finally, let c be a normalization factor for the total rank of all Web pages to be constant. Then the rank (simplified PageRank) R of u is computed like this:

from the structure of hyperlinks. It is used to identify the

R u = c

R v

(1)

relationship between Web pages linked by information or direct link connection. This connection allows a search engine to pull data relating to a search query directly to the linking Web page from the Web site the content rests upon.

C. WEB USAGE MINING

Web usage mining exploit user access patterns from web usage logs, which involves progression of finding out what users are looking for on the web. Some users might be looking for only textual data, whereas some others are interested in multimedia. In order to figure out the usage pattern.

Further the rest of paper is organized as follows: section II provides a brief technical description of several Page Rank Algorithms. Section III provides theoretical comparison and analysis of those approaches based on the strength and limitations of each. And finally section IV delivers the conclusions.
TECHNICAL DESCRIPTIO OF SEVERAL PAGE RANK ALGORITHMS

PageRank was designed to increase the effectiveness and improve the efficiency of the search engines. It is used to measure the importance of a page and to prioritize the pages returned from a traditional search engine using keyword searching. Google uses this technique. The PageRank value for a page is calculated based on the number of pages that point to it

vB u Nv

The above formula is recursive for computing page rank of any Web page. Therefore, they started with a vector of ranks initialized to some (arbitrary) values, iteratively update those ranks using above formula and wait until they unite. Experiments show that the full version of PageRank converged after around 50 recursive iterations. Figure 3 is an example of a page rank calculation.

Fig 3: Example of a Page Rank Calculation

There is a little problem with so called rank sinks. Those are closed loops of pages that accumulate rank but never distribute it further, such as in Figure 4. To overcome this issue some modifications of the ranking.

Fig 4: Rank Sink

To overcome this trouble some modifications of the ranking was carried out. If a backlink comes from an important webpage than this link is given higher weightage than those which are coming from non-important webpages.

It has two steps:

Fig. 5: Hubs and Authorities

The link from one webpage to another is considered as a vote. Not only the number of votes that a page receives is important but the importance of webpages that casts the vote is also important.

Page and Brin proposed a formula to calculate the PageRank of a page A as stated below-

PR(A)=(1-d)+d(PR(T1)/C(T1)+..+PR(Tn/C(Tn)) (2)

Here PR(Ti) is the PageRank of the Pages Ti which links to page A, C(Ti) is number of outlinks on page Ti and d is damping factor. Damping factor is used to stop other pages having too much impact. The total vote is damped down by multiplying it to 0.85.
1. Sampling Step: In this step a set of relevant pages for the given query are collected.
2. Iterative Step: In this step Hubs and Authorities are found using the output of sampling step. Following equations are used to calculate the weight of Hub and the weight of Authority.
  
  Here Hq is Hub Score of a page, Ap is authority score of a page, I(p) is set of reference pages of page p and B(p) is set of referrer pages of page p, the authority weight of a page is proportional to the sum of hub weights of pages that link to it. Similarly a hub of a page is proportional to the sum of authority weights of pages that it links to.
  
  The score value of hubs and authorities are calculated as follows:
  
  The PageRank forms a probability distribution over the webpages so the sum of PageRanks of all web pages will be one. The PageRank of a webpage can be calculated without knowing the final value of PageRank of other pages. It is an iterative algorithm which follows the principle of normalized
  
  Ap = qB(p) Hq
  
  Hp = qI(p) Aq
  
  (3)
  
  (4)
  
  link matrix of web. PageRank of a webpage depends on the number of pages pointing to a page.
  
  Figure 6 shows an example of the calculation of authority
  
  and hub scores
  
  B. HITS (Hypertext Induced Topic Search) [4] q1 r1 Jon Klienberg gave two forms of web pages, so called as
  
  hubs and authorities.
  
  Hubs are the webpages that act as resource lists. A good quality hub page is a webpage which is pointing to many
  
  authoritative pages on that content. Authorities are pages q2 p r2
  
  having important contents. A good quality authority page is a page which is pointed by many good hub pages on the same content.
  
  A page may be a good quality hub and a good quality q3 r3 authority at the same time. The HITS algorithm treats WWW
  
  as directed graph G(V,E), where V is a set of vertices representing pages and E is set of edges corresponds to link.
  
  Figure 5 shows the hubs and authorities in web.
  
  Ap=Hq1+Hq2+Hq3
  
  Hp=Ar1+Ar2+Ar3
  
  Fig. 6: A small example of HITS calculations
  
  HITS is a simply link-based algorithm. It is used to ranking the pages which is mainly retrieved from Web and based on their textual contents to a given query. First these
  
  pages have been place together in proper way than the HITS algorithm ignores textual content and focuses itself on the structure of the Web only.
  
  Constraints with HITS algorithm
  
  Following are some constraints of HITS algorithm:
  1. Hubs and Authorities: It is difficult to distinguish among hubs and authorities because many sites are hubs as well as authorities.
  2. Topic drift: Sometime HITS may not give the most relevant documents to the user queries because of equivalent weights.
  3. Automatically generated links: HITS gives equal importance for automatically generated links which may not
    
    D. PAGE RANKING BASED ON NUMBER OF VISIT OF LINKS OF WEBPAGE [7]
    Page Ranking Algorithm based on VOL [7] is another extension to the standard Page Ranking Algorithm [1] and also the user perspective is considered as number of visits of inbound links. In the Page Ranking Algorithm based on VOL, author assigns more weights to the outgoing links which are most frequently visited by the user. In this algorithm it is assigned more rank value to the outgoing links which is most visited by users. In this manner a page rank value is calculate based on visits of inbound links.
    
    The modified version based on VOL is given in the following equation
    
    have relevant topics for the user query.
    
    PR u = 1 d + d Lu (PR v )
    
    (8)
  4. Efficiency: It is not efficient in real time.
  In IBM research project, HITS was used in a prototype search engine called Clever. Due to above constraints HITS
  
  Notations are:
  
  vB(u)
  
  TL (v)
  
  could not be implemented in a real time search engine.
  
  C. WEIGHTED PAGERANKALGORITHM [5]
  
  (m ,n)
  
  This algorithm was proposed by Wenpu Xing and Ali Ghorbani which is an improvement of PageRank algorithm. This Algorithm assigns the rank values to webpages according to their importance rather than dividing it evenly. The importance is assigned in terms of weight values to incoming and outgoing links. This is denoted asWin and
  
  Wout respectively. Win is the weight of link(m,n). It is
  
  Lu denotes number of visits of link which is pointing page u from v.TL (v) denotes total number of visits of all links present on v.Other notations are same as in original PageRank equation.
  
  In order to find the number of visit of the links, client side script is used. For every webpage visited, the client side script is loaded to client side from the web server. Through the script the click and keyboard events are handled. And also in the web search crawler a counter is
  
  raised with every webpage retrieved with respect to any
  
  (m ,n) (m ,n)
  
  calculated on the basis of number of incoming links to page n and the number of incoming links to all reference pages of page m.
  
  user query. Every time a webpage is visited the counter is raised or increased by one and later on fetched to compute the ranking of the webpages. The Page Ranking algorithm based on VOL is less complex than standard PR Algorithm
  
  Win = In
  
  (5)
  [1], and advantageous as it includes the user input for
  
  m ,n
  
  p R m Ip
  
  calculating the rank. More complex when one talk about the design issues and is limited to the inbound links for
  
  In is number of incoming links of page n, Ip is number of incoming links of page p, R(m) is the reference page list of page m.
  
  (m ,n)
  
  Wout is the weight of link(m,n). It is calculated on the basis of the number of outgoing links of page n and the number of outgoing links of all the reference pages of page
  
  m. On is number of outgoing links of page n,
  
  computing the page ranking.
  1. ANIMPROVED PAGE RANKING ALGORITHM BASED ON OPTIMIZED NORMALIZAION TECHNIQUE [6]
    It shows a step extension to the standard PR Algorithm [1] & add the normalization feature to the ranking. It reduces the overall complexity of the ranking process by reducing
    
    Wout = On
    
    (6)
    
    the number of iteration to calculate the page rank. Steps of
    
    m ,n
    
    p R m Op
    
    the Page Ranking Algorithm based on optimized normalization technique are:
    
    Op is number of outgoing links of page p, then the weighted PageRank is given by formula:
    
    (m ,n) m ,n
    
    PR n = 1 d + d mB(n) PR(m) Win Wout (7)
    
    The PR and WPR algorithms both provide ranked pages in the sorting order to users based on the given query. So, in the resultant list, the number of relevant pages and their order are very important for users. Relevance rule is applied for calculation of the relevancy value in each page from the list of pages which made Weighted PageRank different from PageRank.
    
    After study of WPR it is clear that WPR produces larger relevancy values than the PR.
    - Initially assume PageRank of all web pages to be any value, let it be1.
    - First Calculate page ranks of all pages by following equation:-
      
      PR(A) = 0.15 + 0.85 (PR(T1)/C(T1) + PR(T2)/C(T2) +.
      
      + PR(Tn)/C(Tn)) (9)
      
      Where T1 through Tn are pages providing incoming links to Page A,PR(T1) is the Page Rank of T1,PR(Tn) is the Page Rank of Tn,C(Tn) is total number of outgoing links on Tn.
    - And calculate mean value of all page ranks by following equation:
      
      Summation of page ranks of all web pages / number of web pages
    - Now normalize page rank of each page by following formula:
      
      Norm PR (A) = PR (A) / mean value Where Norm PR (A) is Normalized Page Rank of page
      
      A and PR (A) is page rank of page A. and Assign PR(A)= Norm PR (A)
    - Repeat until page rank values of two consecutive iterations are same.
      
      The normalized page ranking is less complex than the standard PR Algorithm as the number of iterations is reduced by using the normalized values hence the time complexity is reduced. In this technique of page ranking the number of iterations are removed which reduces the computational complexity of the algorithm. As it is link based algorithm hence the problem of theme drift exists.
  2. AN IMPROVED METHOD FOR THE COMPUTATION OF PAGERANK [8]
    This paper improved the results of Web search by utilizing the link structure of the web and also given some analysis of topic bias, absolute equalization and emphasis on old web about the algorithm, which influences the ranking quality of websites. In order to solve these problems, it is proposed an improved method for the computation of PageRank on the basis of integrated factors. The experiment shown that improved method outperforms the PR [1] algorithm in the quality of the pages returned.
    
    Authors of this paper used various improvements to existing algorithms, taking topic character, distribution of PageRank value and time factor into consideration; at last, they integrated two improvements to proposed method:
    - Introduced topic Character to Eliminate Topic Bias.
    - Introduced time factor to emphasis on new web. So ultimate improved method for the computation was
      
      N
      
      PR u = dTu [ vB u PR V ( m + (1 m)Wv ) +
      
      v
      
      1 d ] (10)
      
      In the formula, Wv denotes the topic weight between page u and page v, Tu denotes the time factor of page u.
      
      There are topic character and time factor taken into account and method for the computation of PageRank shown more excellent search performance compared to the classic PageRank algorithm.
  3. WEIGHTED PAGE RANK ALGORITHM BASED ON NUMBER OF VOL OF WEBPAGE [7]
  This Algorithmis used to get more relevant information according to users query. Hence, this concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behaviour, which reduce the search space to a large scale.
  
  The modified version based on WPR(VOL) is given in the following equation
  
  TL (v)
  
  (,
  
  vB(u) (11)
  
  WPR u = 1 d + d L u WPR v )
  
  Where u represents a web page, B(u) is the set of pages that point to u, d is the dampening factor. vol
  
  (u) and vol (v) are rank scores of page u and v respectively, Lu denotes number of visits of link which is pointing page u from v.TL(v) denotes total number of visits of all links present on v.
  
  Visits of links are computed as explained in Page ranking based on number of visit of links of webpage method.
  
  The top returned pages in the result list is supposed to be highly relevant to the user information needs, the ordering of pages using WPR (VOL) is more target- oriented and user cant purposefully increase the rank of a page by visiting the page multiple times because the rank of the page depends on the not on the count of visits on back linked pages.

THEORETICAL COMPARISON & ANALYSIS

Parameters / Algorithm	Page Ranking Algo.[1]	HITS Algo.[4]	Weighted Page Rank Algo. [5]	Page Ranking Based on visit of link of pages[7]	Improved Page Ranking Algo.[6]	Improved method for computation of PR[8]	WPR Algorithm Based On Number of VOL of Webpage[9]
Mining Technique used	Web Structure Mining	Web Structure Mining, Web Content Mining	Web Structure Mining	Web Structure Mining, Web usage mining	Web structure mining	Web Structure Mining, Web Content Mining	Web Structure Mining, Web usage mining
Technique	Based on Web graph of the webpages and uses the inbound links to rank the webpages	Hubs and Authority score is used to rank the webpages	Based on the popularity of Inlinks and Outlinks the page ranking is done	Enhance the standard page ranks by considering the VOL of pages	By mean values of the rank of all pages replaces the ranks of all pages than takes another round & carry on until the value of two consecutive pages becomes equal	Takes topic character and time factor into account to rank the webpages	Improve the WPR by considering the VOL of pages
I/P Parameters	Inbound links of the pages	Content, Back and Forward links	Backlinks, Forwardlinks	Inbound links, outbound links, Count of VOL.	Inbound links	Content, Inbound links of other pages	Backlinks, Forwardlink s, Count of VOL.
Strength	Ranking is done on the basis of importance of the pages	Relevancy of the pages is high	Higher relevancy because popularity of the links is considered	User input is considered therefore the relevancy of the pages is higher	Reduces the computational complexity	Higher relevancy, the basis of topic character & time factor taken	User input is considered so relevancy of the pages is higher
Relevancy of the Pages	Medium	Less than PR	Higher than PR	More than PR	More than other and less to PR	More than PR	More than WPR
Limitations	Ranking is at indexing time, Query dependent, Emphasis on old pages	Query dependent, Topic drift and all links are considered of same important	Theme drift and efficiency problem	Query and user visit dependent, Them drift, emphasis on ol pages	Query dependent, Relevancy is low, Them drift, emphasis on old pages	Query dependent, them drift	Query and user visit dependent, Them drift, efficiency problem

CONCLUSIONS

In todays fast growing world, to get accurate information is more important and needful. On accomplishing theoretical analysis and comparison this paper concludes that Page Ranking needed more powerful algorithm which includes maximum relevancy in result pages as per users nature.

ACKNOWLEDGMENTS

Bipin Vekariya would like to wholeheartedly thank Asst. Prof. Ankit Vaishnav for all his diligence, guidance, encouragement, inspiration and motivation throughout. I would like to thank anonymous reviewers for his valuable suggestions.

REFERENCES

S.Brin and L.Page, The Antonomy of a Large Scale Hyper textual Web Search Engine 7th Int.WWW Conf. Proceedings, Australia, April 1998.
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg, Mining the Webs link structure. Computer 32(8):6067, 1999.
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604632, September 1999.
C. Ding, X. He, H. Zha, P.Husbands and H. Simon Link Analysis: Hubs and Authorities on the World Technical Report: 47847, 2001.
W.Xing and A.Gorbani, Weighted PageRank Algorithm Proceedings of the Second Annual Conference on Communication Networks and Services Research, May 2004, pp. 305-314.
H. Dubey and Prof. B.N. Roy, An Improved Page Rank Algorithm

based on Optimized Normalization Technique International Journal of Computer Science and Information technologies (IJCSIT), 2011, pp.2183-2188.
G.Kumar, N. Duhan and A.K. Sharma, Page Ranking Based on Number of Visits of Web Pages International Conference on Computer & Communication Technology (ICCCT), 2011, pp. 11- 14.
Wei Huang,Bin Li, An Improved Method for the Computationof PageRank International Conference on Mechatronic Science,

Electric Engineering and Computer August 19-22, 2011
N.Tyagi and S. Sharma,Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page,International Journal of Soft Computing and Engineerig(IJSCE),July 2012.

Exploration of Several Page Rank Algorithms for Link Analysis

Leave a Reply