Literature review on Web Crawling

Sarvesha Chodankar; Amanda Michael; Siddharth Walke; Dr. C.H.Patil

doi:10.17577/IJERTCONV8IS05056

ICSITS - 2020 (Volume 8 - Issue 05)

Literature review on Web Crawling

DOI : 10.17577/IJERTCONV8IS05056

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 641
Authors : Sarvesha Chodankar, Amanda Michael, Siddharth Walke, Dr. C.H.Patil
Paper ID : IJERTCONV8IS05056
Volume & Issue : ICSITS – 2020 (Volume 8 – Issue 05)
Published (First Online): 19-03-2020
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Literature review on Web Crawling

Sarvesha Chodankar

School of Computer Science Dr. Vishwanath Karad MITWPU

Pune, Maharashtra

Siddharth Walke

School of Computer Science Dr. Vishwanath Karad MITWPU

Pune, Maharashtra

Amanda Michael

School of Computer Science Dr. Vishwanath Karad MITWPU

Pune, Maharashtra

Dr. C.H.Patil

School of Computer Science Dr. Vishwanath Karad MITWPU

Pune, Maharashtra

Abstract – This work provides a literature review on Web Crawling. A web crawler is a software program which browses the World Wide Web, creates a list of web pages, and indexes it so that the required information can easily be retrievable. This process is called Web Crawling. This paper includes brief details about web crawling.

Keywords- crawler, Wandex, MOMspider, PageRank, Deep web, focused crawling, path-ascending crawling.

INTRODUCTION

Web Crawlers are also called spiders or bots. Web crawling basically maps the internet as a graph where the nodes are the web pages and the edges are links showing how they are related to each other. Hence they are used by search engines to not only index web pages and to keep database up to date but also tests websites vulnerability.

Web Crawlers can crawl through only public pages on websites and not the private pages which are referred to as dark web.[1]
The search engines highly rely on the web

crawlers because the amount of information on internet is increasing day by day as the new pages are added or updated, to learn about this updates or new data the web crawler constantly crawls, keeps on updating the search engine database and provides better results to search engines.[3]
HISTORY

In 1993, the first crawler named World Wide Web Wanderer was introduced .[2]In the same year more 3 crawlers were introduced named Jump Station, World Wide Web Worm and RBSE spider.[3] Initially these crawlers were used to compute the size of the web ,later they were used to retrieve URLs from the first websearch engine named Wandex.[2] In 1994,WebCrawler and MOMspider were introduced. In just one year the number of pages raised from 110,000 to 2 million. Later a few more crawlers such as Lycos, Infoseek, Excite, AltVista and HotBot became available. In 1998 web crawler Google was introduced which reduced the disk access time through compression and indexing techniques and also used an algorithm named PageRank which calculated the probability of a user visiting a page.
WORKING OF WEB CRAWLER

The very first step of web crawling is to begin with seed URLs(a set of URLs). The Crawler moves from one link to another, extracting new links available in the downloaded pages. Then the retrieved pages are stored in the database in a well indexed manner so that they can easily be retrieved in future. It is checked if URLss related documents are downloaded or not and the process of downloading documents is repeated till all URLs are downloaded.[4].

Before starting with the crawling process the crawler is supposed to read a file called robots.txt which instructs crawler which parts of the website to ignore. In a way robots.txt file provides a way to control what crawler can see on a website. Crawler eventually deletes a page from index when it cant find it however some crawlers check for pages twice in such cases.[2]
The working of web crawler can be defined in steps as follows:

Selecting a starting seed URL or URLS

Fetching the web page corresponding to selected URL Parsing the same web page to find new URL links Adding new URLs to database

Go to step 2 and repeat the process until no more URLs are left.[4]
CHARACTERISTICS OF WEB CRAWLER

Distinguishing and removing duplicate data obtainable from various websites.[6]
Capable of executing on various machines simultaneously.[6] Compatibility with static as well as dynamic pages.[6] Continuous crawling [1]
TYPES OF WEB CRAWLER
1. General-Purpose Web Crawler
  
  The crawlers collect and fetches the entire contents of web and store it in a centralized location so they can be indexed in advance.[2]
2. Incremental Web Crawler
  
  The crawlers keep on updating the set of pages and store their recent copy. The crawler keeps track of available pages, checks if they have changed from last time and updates accordingly.[6]
3. Deep web crawler
  
  There are more than 90% of data on the web that are not even accessible by the crawlers, such type of data is known as deep/dark web. To deal with such data deep web crawlers are designed.[6]
4. Distributed Crawler
  
  The crawling process is divided into multiple processes due to an increase in data on the web. The distributed web crawler multiple crawlers download web pages in parallel and resulted pages are sent to a central indexer where the links are extracted and sent to the crawlers via the URL server.[6]
5. Parallel web crawler
  
  Sometimes it is difficult to retrieve a web page using a single process therefore to get the required page, search engines run multiple processes parallelly. This is done by parallel web crawler.[2]
6. Adaptive Crawler
  
  The adaptive crawler uses previous cycles of crawling to check which pages should be updated. This crawler is called as an incremental type of crawler.[2]
7. Breadth First crawler
The crawling process is done in breadth-first fashion.[2]
ALGORITHM FOR WEB CRAWLING Different Algorithms to handle web crawling procedure efficiently:
1. Focused Crawling:
  
  This algorithm downloads pages that are similar to each other, also known as Topical or Focused crawler. The context of the page is an important aspect for this algorithm. The main task of this algorithm is to predict the similarities between pages before downloading them. The focused crawling algorithm's performance depends mostly on the richness of the links in a specific topic being searched.[5]
2. Path – Ascending Crawling:

Path-ascending crawler helps in finding isolated pages which the regular crawler would have not found. [5] The path ascending crawler ascends to every path in each URL and download all possible documents from website

CONCLUSION

The data on the internet is increasing rapidly and will continue to increase in future. Hence a good crawling algorithm is important which will function rapidly, at the same time will give better results. They are often considered

as the basic component of web services. There are already a number of crawling algorithms in the market used by various search engines. Choosing the right strategies and architecture implementation is a main task in building a web crawler.

REFERENCES

Mini Singh Ahuja, Dr Jatinder Singh Bal, Varnica. Web Crawler: Extracting the web data
S.Amudha. Web Crawler for mining web data
Seyed M. Mirtaheri, Mustafe Wmre Dincturk,Salman Hooshman, Gregor V.

Bochmann, Guy-Vincent Jourdan, A Brief History of Web Crawlers
Md. Abu Kausar, V.S. Dhaka, Sanjeev Kumar Singh. Web Crawler:A Review
Yaduvir Singh Review paper on Web Crawler
Vandana Shrivastava A Methodical Study of Web Crawler.

Literature review on Web Crawling

Leave a Reply