Partitioning Based Web Content Mining

DOI : 10.17577/IJERTV1IS3191

Download Full-Text PDF Cite this Publication

Text Only Version

Partitioning Based Web Content Mining

Partitioning Based Web Content Mining

Niki R. Kapadia

Kanu Patel

Mehul C.Parikh

ME CSE, GEC Modasa

ME CSE, GEC Modasa

Assistant Professor,GEC-Modasa

Abstract Today the Web has become the largest information source for people. Most information retrieval systems on the Web consider web pages as the smallest and undividable units, but a web page as a whole may not be appropriate to represent a single semantic. A web content structure analysis based on visual representation is proposed in this dissertation work . Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. Furthermore, web page often contains multiple topics that are not necessarily relevant to each other. Therefore, detecting the semantic content structure of a web page could potentially improve the performance of web information retrieval. This dissertation work presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception and information can be mined dynamically. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTM L.

Keywords Data mining, web mining, web content mining, web structure mining, web usage mining, clustering, segmentation.


    Web min ing is the data min ing technique that automatica lly discovers/extracts the information fro m web documents. It is the extract ion of interesting and potentially useful patterns and implic it information fro m a rtifacts or activity related to the World Wide Web. The term Web Data Mining is a technique used to crawl through various web resources to collect required informat ion, which enables an indiv idual or a company to pro mote business, understanding ma rket ing dynamics, new pro motions floating on the Internet, etc. There is a gro wing trend a mong co mpanies, organizat ions and individuals alike to gather information through web data mining to utilize that information in their best interest.

    mining techniques in detail, results and comparison to extract necessary informat ion effect ively and effic iently.


    Web min ing process is shown in figure 2. And its steps are given below:

    Figure 1 web mining process

    Steps of web min ing process:

    1. Resource Finding : – It is the task of retrieving intended web documents.

    2. Information selection and preprocessing:- Automatica lly selecting and pre-processing specific fro m informat ion retrieved web resources.

    3. Generalization:-Automatica lly d iscovers general patterns at individual web sites or mult iple sites.

    4. Analysis:-Va lidation and interpretation of the mined



    Web mining can be categorized as below.

    Fig 2. Web mining categories

    1. We b Conte nt Mining : – Web content mining is the process of extract ing useful information fro m the contents of web documents. It is related to data min ing. It is re lated to text mining because much of the web contents are text based. Te xt min ing focuses on unstructured texts. Web content mining is se mi-structured nature of the web. Technologies used in web content mining are NLP, IR.

    2. We b Str ucture Mi ning: – tries to discover useful knowledge fro m the structure and hyperlin ks. The goal of web structure mining is to generate structured summery about websites and web pages. It is using tree-like structure to analyze and describe HTM L or XM L.

    3. We b Us age Mining: – Web usage mining is the process by which we identify the browsing patterns by analyzing the navigational behavior of user. It focuses on technique that can be used to predict the user behavior wh ile user interacts with the web. It uses the secondary data on the web. This activity involves automatic d iscovery of user access patterns fro m one o r mo re web-servers. It consists of three phases namely: p re-processing, pattern discovery, pattern analysis. Web servers, proxies and client applications can quite easily capture data about web usage.


    The vision-based content structure of a page is obtained by combin ing the DOM structure and the visual cues. It is the combination of hie rarchica l and partit ioning based method. The segmentation process is illustrated in figure 3. It has ma inly three steps: (a) bloc k e xtraction, (b) separator detection and (c) content structure construction. These three steps as a whole are regarded as a round. The a lgorith m is top – down. The web page is firstly segmented into several big blocks and the hierarchica l structure of this level is recorded. For each b ig block, the sa me segmentation process is carried out recursively until we get sufficiently s ma ll blocks.

    Figure 3 The v ision-based page segmentation algorith m

    1. Visual Block Extrac tion.

      In this step, we a im at finding all appropriate visual b locks contained in the current subpage [16]. In general, every node in the DOM tree can represent a visual block. Ho wever, some

      huge nodes such as <TABLE> and <P> are used only for organization purpose and are not appropriate to represent a single visual block. In these cases, the current node should be further divided and replaced by its children. Due to the fle xib ility of HTM L gra mmar, many web pages do not fully obey the W3C HTM L specification, so the DOM tree cannot always reflect the true re lationship of the different DOM node.

      For each e xt racted node that represents a visual block, its Do C value is set according to its intra visual d iffe rence. Th is process is iterated until a ll appropriate nodes are found to represent the visual blocks in the current sub-page.

      Algorith m Div ideDo mtree(p Node,nLevel)


      IF(Dividable(p Node,nLevel)==TRUE) FOR EACH ch ild OF pNode {

      DivideDo mtree (child,n Level);


      } ELSE {

      Put the sub-tree (pNode) into the pool as a block;



      Figure 4 The v isual block e xtraction a lgorith m

    2. Visual Separ ator Detecti on.

      After all blocks are e xtracted, they are put into a pool for visual separator detection. Separators are horizontal o r vertica l lines in a web page that visually cross with no bloc ks in the pool. Fro m a v isual perspective, separators are good indicators for discriminating different semantics within the page. A visual separator is represented by a start pixe l and end pixe l. The width of the separator is ca lculated by the diffe rence between these two values.

    3. Separ ator Detec tion

      The visual separator detection algorith m is described as follows:

      Firstly the separator list will be in itia lized. The list starts with only one separator, whose start pixel and end pixel are corresponding to the borders of the pool. For every block in the pool, the relation of the block with each separator will be evaluated. Then re move the separators that stand at the border of the pool.

      Figure 5 A sa mple page and the separator detection process [16]

      Take figure 5(a ) as an e xa mp le in which the blac k b locks represent the visual blocks in he page. For simp lic ity we only show the process to detect the horizontal separators. At first we have only one separator that is the whole pool. As shown in figure 5(b ), when we put the first block into the pool, it splits the separator into S1 and S2. It is the sa me with the second and third block. When the fourth block is put into the pool, it crosses the separator S2 and covers the separator S3, the parameter of S2 is updated and S3 is re moved. At the end of this process, the two separators S1 and S3 that stand at the border of the pool are re moved.

    4. Conte nt Struc ture Construc tion

    When separators are detected and separators weights are set, the content structure can be constructed accordingly [16]. The construction process starts from the separators with the lowest weight and the blocks beside these separators are merged to form new b locks. This merg ing process iterates till separators with ma ximu m we ights are met.

    After that, each leaf node is checked whether it meets the granularity require ment. For every node that fails, we go to the Visual Bloc k Extract ion step again to further construct the sub content structure within that node. If all the nodes meet the require ment, the iterative process is then stopped and the vision-based content structure for the whole page is obtained. In summary, the proposed VIPS a lgorithm ta kes advant age of visual cues to obtain the vision-based content structure of a web page and thus successfully bridges the gap between the DOM structure and the semantic structure. The page is partitioned based on visual separators and structured as a hierarchy. Th is semantic hie rarchy is consistent with human perception to some e xtent. VIPS is also very efficient. Since we trace down the DOM structure for v isual block e xtract ion and do not analyze every basic DOM node, the algorithm is totally top-down.


    We will illustrate exa mp le of v ision-based content structure for web page actually used for project purpose, IDOM technique, dataset, coding language, database software etc. We are using web source for min ing the web data. The entire process from e xt racting of web data to content structure generation is elaborately discussed.


    1. We b Ser ver and Database

      WAMP, used for project purpose is a Windows OS based program that installs and configures Apache web server, MySQL database server, PHP scripting language, phpMyAdmin (to manage MySQL databases), and

      SQLiteManager (to manage SQLite databases). WAMP is designed to offer an easy way to install Apache, PHP and MySQL package with an easy to use installation program instead of having to install and configure everything yourself.

      MySQL is currently the most popular open source database server in e xistence. On top of that, it is very co mmonly used in conjunction with PHP scripts to create powerful and dynamic server-side applications. MySQL has been critic ized in the past for not supporting all the features of other popular and more e xpensive Database Management Systems and it has become widely popular with individuals and businesses of many diffe rent sizes.

    2. Programming Language-PHP

      PHP is a genera l purpose server-side scripting language originally designed for web develop ment to produce Dynamic Web page[15]. It is one of the first developed server-side scripting languages to be embedded into an HTM L source document, rather than calling an e xterna l file to process data. Ult imately, the code is Interpreter by a Web server with a PHP processor module wh ich generates the resulting Web page. PHP can be deployed on most Web servers and also as a standalone Shell on almost every Operating system and Platform free o f charge. A co mpetitor to Mic rosoft's Active Server Pages (ASP) server-side script engine and similar languages, PHP is installed on mo re than 20 million Web sites and 1 million Web servers.

      PHP is an HTML-e mbedded scripting language. Much of its syntax is borrowed fro m C, Java and Pe rl with a couple of unique PHP-specific features thrown in. The goal of the language is to allow web developers to write dynamica lly generated pages quickly. When someone visits y our PHP webpage, your web server processes the PHP code. It then sees which parts it needs to show to visitors (content and pictures) and hides the other stuff (file operations, math calculations, etc.) then translates your PHP into HTM L. After the translation into HTM L, it sends the webpage to your visitor's web browser.

      PHP will a llo w you to:

      Reduce the time to create large websites.

      Create a customized user e xperience for visitors based on informat ion that you have gathered from them.

      Open up thousands of possibilities for online tools. Check out PHP – HotScripts for e xa mples of the

      great things that are possible with PHP.

      Allow creation of shopping carts for e-commerce websites.

    3. Doc ument Object Model(DOM)

      The Document Object Model is a platform and language – neutral interface that will a llow progra ms and scripts to dynamica lly access and update the content, structure and style of documents. The document can be further processed and the results of that proces sing can be incorporated back into the presented page. Dynamic HTM L" is a term used by some vendors to describe the combination of HTM L, style sheets and scripts that allo ws documents to be animated.

    4. VIPS Algorithm

      Input: A set of web pages (W) fro m a g iven website (screen shot as shown in figure 5.2), ma ximu m nu mber of b locks for outputting in each web page (N).

      Output: A data set containing plain te xt as we ll as in Tabulated form fro m web pages.


      1. Apply vision based partitioning algorith m using IDOM technique to segment web pages, (W) into blocks .

      2. Data Extract ion and storing of data set into database.

      3. Generate cleaned files records.




        Figure 6 (a) Website Screenshot (b) Vision-based Content Structure.

    5. Resultant Data set.


    For the analysis of the results, WEKA tool has been used.

    The Weka Knowledge Exp lorer is an easy to use graphical user interface that harnesses the power of the weka software. Each of the ma jor we ka packages Filters, Classifie rs, Clusters, Associations, and Attribute Selection is represented in the Exp lore r a long with a Visualizat ion tool which allows datasets and the predictions of Classifie rs and Clusters to be visualized in two dimensions.

    Time (Second)

    In WEKA tool, analysis of diffe rent clustering algorith ms such like Hierarch ical, K Means and density based are performed. The t ime taken by these algorith ms are mapped and shown in graphs. and shown in the below graphs.






    Hireachical K Means Density Based

    Cluster Method

    Graph-1 Execution Time for various cluster algorithms

    Percentage (%)




    0% Cluster




    It is easier to mine the website data using php script language and wamp server. The same has been used for dissertation work. The clean data as desired is mined from the website. The mined data is downloaded in *.csv format from wamp server for analysis. The weka tool is used for the analysis, using different cluster methods. The generated instances and execution time for different cluster methods represent that hierarchical cluster method takes higher execution time and higher 0 instances with respect to K Means and Density based cluster algorithms.


An approach for e xtracting web content structure based on visual representation is proposed. The resulted web content structure is very helpful fr applicat ions such as web adaptation, information retrieval and information e xtraction. By identifying the logic relat ionship of web content based on visual layout information, web content structure can effectively represent the semantic structure of the web page. An automatic top-down, tag-tree independent and scalable algorith m to detect web content structure is presented. It simu lates how a user understands the layout structure of a web page based on its visual representation. Co mpared with traditional DOM based segmentation method, our scheme utilizes useful visual cues to obtain a better part ition of a page at the semantic leve l. The algorith m is evaluated manually on a large data set, and also used for selecting good expansion terms in a pseudo-relevance feedback process in web informat ion retrieval, both of which achieve very satisfactory performance.

Recently, the developed algorithm is imp le mented on website with horizontal separation. The further work would be focused on development of improved code in order to take care of vertical separation also. Further, the various experiments of developed code will be carried out on different do main website such like education, shopping, social networking etc. to check accuracy of the code and to judge which consideration will g ive better results.

This work will be useful in understanding VIPS algorith m, web content mining through partition based segmentation and further research d irections in this area to co mputer engineering fratern ity.

Cluster Algorithms

Graph-2 Result of Cluster Instance


  1. Free encyclopedia. (2012, May 13). Data Min ing[On line].

    Available : http://en.wikipedia .org/wiki

  2. Bing Liu,Web Data M ining,1st Edit ion,Springer,July 2011.

  3. Jia wei Han and Miche line Ka mber, Data M ining: Concepts and Techniques,2nd edition, Morgan Kaufmann Publishers, March 2006.

  4. A le x Be rson, Stephen Smith, and Kurt Thearling. (2010). An Overv iew of Data Mining Techniques [Online].

    Available : m

  5. Web Min ing Techniques, Tools and Data Ware housing.

    Available :

  6. Sou men Chakrabart,Mining the Web, Morgan Kaufmann publisher, Pa rt-II.

  7. Margaret H.Dunham,Data M ining,Pearson

    Education,2nd Ed ition,2007.

  8. m/web-data-mining

  9. Markus Schedl1, Peter Knees1, Tim Pohle 1, and Ge rhard Wid mer, Towa rds an Automatically Generated Music Information System v ia Web Content Mining,in Proceedings International journal of Informat ion

    Processing & Management 47 in 2011, pp.426-439.

  10. Lihui Chen, Wai Lian Chue, (2004), Using Web structure and summarization techniques for Web Content mining, in Proceedings International journal of Information Processing & Management 41 in 2005, pp..1225-1242.

  11. Bing Liu, Kevin Chen-chuan Chang, Editorial: Special issue on Web content Mining, SIGKDD Exp lorations, Vo lu me 6, Issue 2.

  12. Jing Li and C.I. Eze ife , Cleaning Web Pages for Effective Web Content min ing, in p roceedings: DEXA, Published in 2006.

  13. Dorian Py le, Data Preparat ion for data mining, Morgan Kaufmann Publishers, March 1999.

  14. Ji-Rong Wen, Wei-Ying Ma. (2003). VIPS: a Vision- based Page Segmentation. Algorith m [Online]. Availab le FTP: ftp :// 79.pdf.

  15. The PHP Group. (2001). What is PHP? [Online].

    Availab le:

  16. Deng ci, Sjipeng Yu, et. a l., VIPS: A Vision-based Page Segmentation Algorithm , Red mond, WA 98052, Nov. 1, 2003.

Leave a Reply