Automated Path Ascend Forum Crawling

Ms. Joycy Joy; Ms. Manju. A

doi:10.17577/IJERTV2IS3641

Volume 02, Issue 03 (March 2013)

Automated Path Ascend Forum Crawling

DOI : 10.17577/IJERTV2IS3641

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 56
Total Downloads : 919
Authors : Ms. Joycy Joy, Ms. Manju. A
Paper ID : IJERTV2IS3641
Volume & Issue : Volume 02, Issue 03 (March 2013)
Published (First Online): 28-03-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Automated Path Ascend Forum Crawling

Ms. Joycy Joy,

PG Scholar Department of CSE, Saveetha Engineering College,Thandalam,

Chennai-602105

Ms. Manju. A,

Assistant Professor, Department of CSE , Saveetha Engineering College,Thandalam, Chennai-602105

AbstractFoCUS (Forum Crawler Under Supervision), is a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. FoCUS is an automation engine that will dynamically crawl the relevant content in a forum . Forum threads contain information content that is the target of forum crawler. Cleanup of data and moving the contents to the appropriate web pages is the major scope of the project. The content of forum may be the queries asked by the users. After crawling the content, FoCUS will dynamically move the queries in the related forum, which will deal the particular query. Then FoCUS cleanup the unrelated query from the particular forum, and that free space is allocated to new queries posted by user. FoCUS take six path from entry page to thread page. It helps the frequent thread updation in forum. FoCUS makes use the technique called differential content extraction, which helps to maintain a record for already crawled data. In each time FoCUS will not crawl the forum data from the beginning, it will maintain a record of already crawled data and manipulates only the newly posted queries.

Keywords EIT Path, Forum Crawling, ITF Regex, URL Type.

INTRODUCTION

Internet forums [12] (also called web forums) are important services where users can request and exchange information with others. It helps to know users opinion about a product and understand what are their expectations. To harvest knowledge from forums, their content must be downloaded first. A web forum crawler which can collect the forum data automatically according to scheduled time such as once in a week. The collected data will be stored in the database. The data can be used for data mining or social network analysis.

In the existing system iRobot forum crawler is used, which crawl the forum content. It does not deal with the frequent thread updation in forum.iRobots tree like traversal didnt allow more than one path from starting page node to same ending page node. So it takes only one path (first path that is entry-board-thread) from entry to thread page. Here sampling strategy and in formativeness estimation is not robust. Existing system doesnt follow the differential content

extraction. That is it doesnt maintain a record of previously stored data. When new queries are posted by user, the crawler can start the crawling process from the beginning in every time. So it become a time consuming process.The main drawbacks of the existing system are:
- No clear segregation of page identification is carried out
- It takes only one path from entry to thread page.
- It doesnt make use of differential content extraction technique.
FoCUS tried to create an automation engine which will take care of traversing the contents dynamically. Moving towards the hyperlinks related to the forum and cleanup the related links. Integrating the missed out data pages in future were considered as the core proposed approaches included in the system. In our proposed system, we are utilizing the features of differential content extraction instead of an inefficient entire system scanning. This option will enhance the performance of the system very much. The option of differential content is done with the help of page indexes and number of links options or link value. In addition, amend and building the knowledge database enable the system a very efficient one in a longer vision. Scanning the entire web pages through Key match cum KnuthMorrisPratt algorithm is used. The proposed system maintain a record of already crawled data.The six paths from entry to thread page are given below:
1. entry board thread
2. entry list-of-board board thread
3. entry list-of-board & thread thread
4. entry list-of-board & thread board thread
5. entry list-of-board list-of-board & thread thread 6.entry list-of-board list-of-board & threadboard
  
  thread
  
  The main advantages of Focus are given below :
  - Automation web crawling is done with this application.
  - FOCUS takes six paths from entry to thread page.
  - Differential content extraction is used.
The major contributions of this paper are as follows:
1. We create an automatic engine which will crawl the forum pages automatically..
2. Focus make use of the technique called differential content extraction which helps to maintain the record of already crawled data.So the effectiveness becomes increased.
3. Cleanup of data and moving the contents to the appropriate web pages is the major scope of FoCUS.
4. After remove the unrelated links ,FoCUS allocate that space to the newly posted queries.
RELATEDWORK

Vidal Caj.R, Yang. J.M, Lai.W, Wang.Y, and Zhang.L[2]

,iRobot has an intelligence to understand the content and the structure of a forum site, and then decide how to choose traversal paths among different kinds of pages. Furthermore, it also achieve the following advantages: (1) significantly decreases the duplicate and invalid pages;(2) saves substantial network bandwidth and storage as it only fetches informative pages from a forum site;(3) It provides a great help for further indexing and data mining;(4) Effectiveness: it intelligently skip most invalid and duplicate pages, while keep informative and unique ones;(5) Efficiency: iRobot only need a few pages to rebuild the sitemap. It is also have some disadvantages such as; It follow a tree like traversal, so it didnt allow more than one path from starting page to ending page. It doesnt deal how to design a repository for forum archiving.

Wang.Y, Yang.J.-M, Lai.W, Cai.R, Zhang.L, and Ma.W.-Y[5] , Exploring Traversal Strategy is a traversal strategy consists of the identification of the skeleton links and the detection of the page-flipping links. Furthermore, it achieve the following advantages: (1) The skeleton links instruct the crawler to only crawl valuable pages and meanwhile avoid duplicate and uninformative ones;(2) page- flipping links tell the crawler how to completely download a long discussion thread which is usually shown in multiple pages in Web forums. It has some demerits such as it doesnt deal with how to optimize the crawling schedule to incrementally update the archived forum content. And also it doesnt deal how to parse the crawled forum pages to separate replies in each post thread.

Brin.S and Page.L[1] ,Web Search Engine it is a large-scale search engine which makes heavy use of the structure present in hypertext, for example is Google. Google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. It

answers tens of millions of queries every day. This paper provides an in-depth description of large-scale web search engine. It makes heavy use of the structure present in hypertext.But it doesnt deal how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want. The technical challenges involved with using the additional information present in hypertext to produce better search results.

Guo.Y, Li. K, and Zhang.K[3] , Board Forum Crawling is a web crawlng method for web forum. This method exploits the organized characteristics of the Web forum sites and simulates human behavior of visiting Web Forums. Board Forum Crawling can crawl most meaningful information of a Web forum site efficiently and simply. Experiments have shown BFC is an efficient and economical method and has been used in a real project. Limiting to the space, the details of the method, such as link clustering based on URL is the main demerit of this paper.
FOCUSA SUPERVISED FORUM CRAWLER
Cleanup of data and moving the contents to the appropriate web pages is the major scope of the project. In this module, the unwanted data will be cleaned up and the forum data will be moved to the web pages according to the technology. The data will be cleaned up and the forums will be moved to the technology in turn, the display of the particular thread should be moved to the appropriate forum. In addition, the knowledge database will get accumulated with lots of knowledgeable information which will be used in future cases of getting more streamlined data processing.
CONCLUSION

FoCUS, a supervised forum crawler was implemented. Focus automatically crawl the forum data and it clean up the unwanted data. FoCUS made use of differential content extraction, which helps to make record of previously crawled data,so it reduce the crawling time for each new crawl. After

cleaning the unwanted data, FoCUS allocated that space to new queries posted by the user.

In future Focus will separate the sparms.Also in future an intimation mail wil send by the user after deleting his irrelevant queries.

REFERENCES

S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, vol. 30, nos. 1-7, pp. 107-117, 1998.
R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang. iRobot: An Intelligent Crawler for Web Forums. Proc. 17th Intl Conf. World Wide Web, pp. 447-456, 2008.
Y. Guo, K. Li, K. Zhang, and G. Zhang. Board Forum Crawling: a Web Crawling Method for Web Forum. Proc. 2006 IEEE/WIC/ACM Intl Conf. Web Intelligence, pp. 475-478, 2006.
C. Gao, L. Wang, C.-Y. Lin, and Y.-I. Song. Finding Question-Answer Pairs from Online Forums. Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 467-474, 2008.
Wang.Y, Yang.J.-M, Lai.W, Cai.R, Zhang.L, and Ma.W.-Y. ,Exploring Traversal Strategy for Web Forum Crawling. Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459- 466,2008.
A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping URLs via rewrite rules. Proc. 14th ACM SIGKDD Intl Conf. Knowledge Discovery

and Data Mining, pp. 186-194, 2008.
M. Henzinger. Finding near-duplicate Web pages: a large-scale evaluation of algorithms. Proc. 29th Ann. Intl ACM SIGIR Conf. Research

and Development in Information Retrieval, pp. 284-291, 2006.
H. S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, S. Garg and

A. Sasturkar. Learning URL Patterns for Webpage De-duplication. Proc. Third ACM Conf. Web Search and Data Mining, pp. 381-390, 2010.
K. Li, X.Q. Cheng, Y. Guo, and K. Zhang. Crawling Dynamic Web

Pages in WWW Forums. Computer Engineering, vol. 33, no. 6, pp. 80- 82,2007.
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for Web crawling. Proc. 16th Intl Conf. World Wide Web, pp. 141-150, 2007.
M. L. A. Vidal, A. S. Silva, E. S. Moura, and J. M. B. Cavalcanti. Structure- driven Crawler Generation by Example. Proc. 29th Ann. Intl

ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 292-299, 2006.
Internet Forum. http://en.wikipedia.org/wiki/Internet_forum.

Automated Path Ascend Forum Crawling

Leave a Reply