Comprehensive Review on Web Content Mining

Er. Reena; Er. Shaveta Angurala

doi:10.17577/IJERTV2IS60701

Volume 02, Issue 06 (June 2013)

Comprehensive Review on Web Content Mining

DOI : 10.17577/IJERTV2IS60701

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 71
Total Downloads : 448
Authors : Er. Reena , Er. Shaveta Angurala
Paper ID : IJERTV2IS60701
Volume & Issue : Volume 02, Issue 06 (June 2013)
Published (First Online): 20-06-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comprehensive Review on Web Content Mining

Abstract

Web is composed of huge and diverse information. Information exists in the form of Hyperlinks having structured tables, semi-structured and unstructured texts and multimedia (audio, video, images). Web mining aim is to drill useful information from hyperlinks, page contents and usage data. Web Mining is categorized into three categories: Web Structure Mining, Web Content Mining, and Web Usage Mining. Web Content Mining mines useful information from Web Page Contents and discover patterns in Web Pages. Web Content Mining is based on text mining and information retrieval (IR) techniques and is used to discover what a web page is about and how to uncover new knowledge from it. We broadly classify Data mining into Web Mining, Text Mining because most of the Web part is unstructured data i.e. text and some structured data i.e. Database Tables. Web Content Mining is also referred to as Web Text Mining. The technologies which are used in Web Content Mining are: Natural Language Processing and Information Retrieval.

Introduction

Combination of Text Mining, Data Mining, Information Retrieval and Machine Learning provides the framework for Web Content Mining. Its basic purpose is to filter the information, improve the information finding and to integrate the data on the web so that more sophisticated queries can be performed on the internet. In simple terms mining, extraction and integration of useful data, information and knowledge from Web page contents. It is related to data mining and text mining because web has almost free text and data mining techniques are applied to them. Content data corresponds to the facts for which the web page was designed to convey to users.

It is processing of Information or resource discovery from contents of millions of sources across the World Wide Web. Web Pages are made up of Hypertext Mark

up Language. Processes involves in Web Mining as follows:
- Retrieval of Web Documents
- Pre-Processing
- Discovery of Patterns
- Analysis
  It is the interpretation and validation of mined patterns. The goal of this process is to eliminate irrelevant rules and extract interesting patterns from pattern discovery output.
  
  1.1Web Content Mining View of Data
  
  Various techniques are applied in it for Web pages representation and content mining. These are as follows:
  1. Pre-processing
  2. Document Representation
  3. Web Mining Operations.
Pre-Processing

It is to filter the data which we have retrieved from the web by navigation and through browsing. It makes the web data suitable for mining.

2.1) Pre-processing for Text Most of the web Part is unstructured data and to transform that original data we have some techniques:

Perceptual Grouping Tokenization

POS Tagging NLP Tasks Parsing

Stemming

Stop word elimination and filtering Latent Semantic Indexing
1. Perceptual Grouping extracts some document level fields such as Author, Title and labeling the text zones. It is used to extract some internal text zones such as paragraphs, columns or tables.
2. Bag of words takes single words and order is discounted. Single words result can be Boolean means whether they are present or not. Each document is treated as a bag of words or terms. These terms whose semantics helps to remember document themes.
3. Tokenization means breaking the text into sentences and words. This is done at different levels and resulting data is referred as tokens which mean character stream is divided into meaningful constituents.
4. Part of Speech Tagging divides words into categories based on the role in the sentence. It provides the information about the semantic content of word. It determines Part of speech tag e.g. noun, verb, adjective for each term.
5. Stop word elimination and filtering means to remove words that bear little or no content information like articles, prepositions and conjunctions. In Filtering
  
  elimination of the common words takes place or the words that occur too often.
6. Stemming reduces the words to a common root
  
  e.g. analysis, analyze, analyzing —- analy.In stemming we remove the suffixes that seems to be identical and we can create the database of the words and their relationships.
7. Parsing produces a parse tree of a sentence which helps in defining relation of one word to another and its function in the sentence.
8. Latent Semantic Indexing Document is represented by D = t1 w1
Where t (term) = keywords or content descriptions w (weight)= measure of the importance of term

Any document is represented as its terms and its associated weights. Term frequency represents how repeated words are strongly related to content and on the basis of this an index is maintained. That index includes the number of terms and is known as Latent Semantic Indexing. This is efficient approach and we can drop low frequency words.
- Removal of HTML tags- The removal of HTML tags can be dealt with similarly to punctuation. One issue needs careful consideration, which affects proximity queries and phrase queries.
- Identify Main content block In Web Pages we have some information like advertisements and banners which are not main content. Two techniques are used to find main content block.
1. Partitioning based on visuals: This method uses visual information to help find main content blocks in a page. Visual or rendering information of each HTML element in a page can be obtained from the Web browser. For example, Internet Explorer provides an API that can output the X and Y coordinates of each element. A machine learning model can then be built based on the location and appearance features for identifying main content blocks of pages
2. Tree matching: HTML has a nested structure, so a Tag Tree is built for page. By this technique we can be able to find some hidden templates that may contain main content blocks when they are quite different in different web pages in same template
Web Page Document Representataion
If we view a Website as a database then we have one more representation for it to simplify its overall structure and to precisely give the overview of its structure. Database which has various relationships with various objects and interlinks.
The main title (on the first page) should begin 1-3/8 inches (3.49 cm) from the top edge of the page, centered, and in Times 14-point, boldface type. Capitalize the first letter of nouns, pronouns, verbs, adjectives, and adverbs; do not capitalize articles, coordinate conjunctions, or prepositions (unless the title begins with such a word). Leave two 12-point blank lines after the title.
Web Mining Operations

These operations are used to discover patterns using various algorithms and techniques which are predictive and descriptive. It consist of various mechanisms for discovering patterns of concept occurrence within a

given document collection or subset of a document collection.

Classification or Text Categorization It gives set of categories and collection of text documents, the process of finding the correct topic for each document. It is also knows as Supervised learning or Inductive learning in machine learning.

Four step scenario applied for unstructured documents in Supervised Learning
1. Data Collection
2. Building the Model
3. Testing and Evaluating the Model
4. Using the model to classify new documents
There are several types of supervised learning tasks. This section will discuss the Supervised learning Techniques for text and Hypertext.
if w1 E D & . . .& wn E D then D E C

Finally, we turn to a discussion of techniques to address supervised learning for hypertext. Apart from plain text, HTML, the dominant form of hypertext on the Web, contains many different kinds of features.

Web Page Classification In this we use hierarchical Web page Categorization for the automatic classification of Web Pages. This is useful because search is restricted to particular pages including those topics. It constraints the number of documents belonging to particular category. If it exceeds then it is further divided into sub-categories. Another feature of the problem is thehyper textual nature of the documents. The Web documents contain links, which may be important sources of information for the classifier because linked documents often share semantics. It can be dealt with by using a separate classifier at every branching point of the hierarchy.

Relational Learning — In Traditional Methods HTML structure, interlinking is ignored. Hyperlink-based classification can be implemented by using graph models (e.g., sub graph isomorphism). It is also known as First order learning or Inductive logic programming.

which uses the language of logic programming (or Prolog) as a representation language for learning.

Let we have documents D and terms t. If term t is there in document D then in means contains (D,t)is true.

Background knowledge is used by relational learning to derive clauses.

Class A (D):-Contains (D, Art) Class B (D):-Contains (D, Biology)

It means if Document D contains Art then it belongs to class A. If D contains Biology then it belongs to class B.

Rule Induction It is basically based on algorithmic approach of Relational Learning and also known as First order logic Algorithm. In background knowledge we have positive examples from which algorithm will generate negative examples by using closed word assumption approach. It covers the entire negative and positives covered which will be used for later on evaluation. Information gain evaluation will give the best literal.

Hyperlink Ensembles –The pages which we want to classify are predecessor pages of target pages and they may contain better information. In this approach we make separate predictions of the various predecessor pages so that feature sets are treated differently. We represent a page with a set of instances and each instance consists of features of predecessor page and class label of target page. A classifier is able to predict a class for links between the pages. All predictions that involve same target page uniformly predict same class.
Clustering is a process through which objects are classified into groups called clusters. It is also known as Unsupervised learning. The problem is to group the unlabelled collection into some meaningful clusters. Clustering tasks in text analysis: Improving search recall, Search Precision, Query specific clustering. The utility of clustering for text and hypertext information retrieval lies in the so-called cluster hypothesis.
1. K-means Algorithm It is also known as partitional clustering algorithm. It proceeds as follows:
  
  k-seeds from the core of k-clusters ( C1,C2Ck) which is a collection of vectors (x1,x2,xn).
  
  Every vector is assigned to cluster of closest seed. Centroid of clusters are computed
  
  When no more changes occur then it converges.
2. Probabilistic Algorithm The basic idea is the assignment of probabilities for the membership of a document in a cluster. Each document can belong to more than one cluster according to the probability of belonging to each cluster. Compute the probability or likelihood of all objects x.Then relabel all objects according to computed probabilities.
3. Hierarchical Clustering Trees of clusters are formed known as dendogram. To merge or split subsets of points rather than individual points, the distance between individual points has to be generalized to the distance between subsets. Such derived proximity measure is called a linkage metric. It describes the concept of closeness and connectivity. Linear algebra methods, based on singular value decomposition (SVD) are used for this purpose in collaborative filtering and information retrieval . SVD application to hierarchical divisive clustering of document collections resulted in the PDDP (Principal Direction Divisive Partitioning) algorithm.
4. Partitioning Relocation Clustering Relocation schemes that iteratively reassign points between k- clusters. One approach to data partitioning is to take a conceptual point of view that identifies the cluster with a certain model whose unknown parameters have to be found. More specifically, probabilistic models assume that the data comes from a mixture of several populations whose distributions and priors we want to find.
5. Fuzzy Clustering Fuzzy clustering approaches, on the other hand, are non-exclusive, in the sense that each document can belong to more than one clusters. Fuzzy algorithms usually try to find the best clustering by optimizing a certain criterion function. The fact that a document can belong to more than one cluster is described by a membership function. The membership function calculates for each document a membership vector, in which the i-th element indicates the degree of membership of the document in the i-th cluster
  
  Other approaches of clustering for Hypertext or semi- structured data.
6. Suffix Tree Clustering Suffix Tree Clustering (STC) is a linear time clustering algorithm that is based on identifying the phrases that are common to groups of documents. A phrase is an ordered sequence of one or more words. The base cluster is a set of documents that share a common phrase. This algorithm was developed by Zamir and Etzioni in 1998.It has three steps: Document cleaning, Identify base clusters using suffix tree and combining these base clusters into clusters. Firstly HTML tags are marked and stripped. Suffix tree is generated with linearity of time and size of document collection .Each node of tree represents cluster and suffix tree represents documents whose phrases are common. In the end to avoid any overlapping merging of base clusters is done.
7. Clustering based on Structure – Syntactic structure of Web pages represented by abstract tree syntax. Distance between two trees is determined by tree edit distance and pages are clustered if they are at minimum distance using agglomerative algorithm.
8. Clustering based on Connectivity- The hyperlinks connecting the Web pages are used as entity descriptions. When a group of pages is highly connected by hyperlinks, the related cohesion metric is given a high score. Clusters are determined which maximize quality of clustering metric
9. Clustering based on Keywords -The content of the Web pages is analyzed in order to determine a set of keywords that characterize each of them. Two pages are clustered together when they share a large number of keywords
10. C-LINK Clustering Similar pairs generated as edges in graph and nodes represent URLs.This approach divides the clusters in such a way that they are close to Centre and all other nodes are close to enough. It is used to form Flat clusters
Associations

Finding the information from the collection of indexed documents using association rules. Its objective is to find all co-occurrence relationships, called associations, among data items. Let Wi and Wk be the set of keywords.

Support The percentage of transactions which includes keywords and can be seen as estimate of probability.

Support = Support count of Wi U Wk

No. of documents

Confidence The percentage of transactions that contain Wi also contain Wk

Confidence = Wi U Wk

Wi . count Association rule mining problem is broken into steps:

Generate all the keyword combinations keyword sets whose support is greater than the user specified minimum support (called minsup). Such sets are called the frequent keyword sets.
Use the identified frequent keyword sets to generate the rules that satisfy a user specified minimum confidence (called minconf). The frequent keywords generation requires more effort and the rule generation is straightforward.

These rules for hypertext identify groupings between set of items with some minimum specified confidence and support.

(4.3.1) Extended Concept Hierarchy Concept Hierarchies define a sequence of mappings from low level concepts to higher level counterparts. It can be represente as Linked list and as a lattice or arbitrary graph structure. We generate rules that relate a structured attributes with concepts in ECH. User provides a structured component and minimum confidence and generation of the rules according to the neighbors that have minimum support and structured attribute values. Once the ECH is created, attribute values of different attributes in the database can be associated with different subsets of concepts in the ECH.It was applied to maintain parent, child and sibling relationships between concepts so that generated rules can create concepts in ECH.

(4.3.2) Direct Association Rules in the Web

Let d be the independent webpage and D website content .A set of Pages X is called page set and number of pages in it represents its length. The page set X is the body and Y is the head of the rule X Y. It represents regularities discovered from large data set.

Database view of Web Page

Database consists of several interlinked relations where each represents certain objects and its relationships. Real life data which is interlinked through some linkages and that is stored in Relational Databases, XML files and other data repositories. The various algorithms can be applied on single table or an independent table. But in Multi-relational Databases its not so easy. In this field we have some challenges. It is not easy to model such table which is interlinked to many objects and finding a such kind of hypothesis that fits the data. Database view tries to convert a website into some kind of database so that a sophisticated query retrieval should be there.

DATABASE VIEW means abstract layer that hides the web under a virtual database view. It is mainly concerned with two objectives:
1. Schema Extraction from semi-structured view
2. Query languages for semi-structured View
5.3.5) Lorel and UnQL query heterogeneous and semi- structured information on the Web using a labeled graph data model
5.4.2) Architecture of MLDB
Multimedia Mining

Multimedia Data essentially refers to data such as text, numeric, images, video, audio, graphical, temporal, relational and categorical data. Multimedia Mining is discovering knowledge from large amounts of different types of multimedia data. It involves the extraction of implicit knowledge, multimedia data relationships, or other patterns not explicitly stored in multimedia databases. Relevant Information and analysis is key aspect of this field. Recent growth of image databases has created a need for effective and efficient retrieval from large databases.

Generally, multimedia database systems store and manage a large collection of multimedia objects, such as image, video, audio and hypertext data. Thus, in multimedia documents, knowledge discovery deals with non-structured information. For this reason, we need tools for discovering relationships between objects or segments within multimedia document components, such as classifying images based on their content, extracting patterns in sound, categorizing speech and music, and recognizing and tracking objects in video streams Multimedia files undergo various operations to extract important features from it.

need tools for discovering relationships between objects or segments within multimedia document components, such as classifying images based on teir content, extracting patterns in sound, categorizing speech and music, and recognizing and tracking objects in video streams Multimedia files undergo various operations to extract important features from it.
1. Multimedia Mining Process
  
  Data collection is the starting point of a learning system, as the quality of raw data determines the overall achievable performance. Then, the goal of data pre-processing is to discover important features from raw data. Data pre-processing includes data cleaning, normalization, transformation, feature selection, etc. Learning can be straightforward, if informative features can be identified at pre-processing stage. The product of data pre-processing is the training set.
  
  6.1.1 Multimedia Mining process
2. Feature Extraction
  
  There are two kinds of features: description-based and content-based. The former uses metadata, such as keywords, caption, size and time of creation. The later is based on the content of the object itself. Image categorization classifies images into semantic databases that are manually precategorized. In the same semantic databases, images may have large variations with dissimilar visual descriptions. Three types of feature vectors for image description:
  1. pixel level features
  2. region level features
  3. tile level features
  Perceptual features such as loudness, brightness, pitch etc. to represent sound clips.Apart from These Total energy, Bandwidth, Pitch period are also used for audio classification. The first stage for mining raw video data is grouping input frames to a set of basic units, which are relevant to the structure of the video. In produced videos, the most widely used basic unit is a shot, which is defined as a collection of frames recorded from a single camera operation. Shot detection methods can be classified into many categories: pixel based, statistics
  
  based, transform based, feature based and histogram based.
3. Data Pre-processing
  
  In a multimedia database, there are numerous objects that have many different dimensions of interests. Selecting a subset of features is a method for reducing the problem size [21]. This reduces the dimensionality of the data and enables learning algorithms to operate faster and more effectively. The problem of feature interaction can also be addressed by constructing new features from the basic features set. This technique is called feature construction/transformation
Conclusion

In this paper we reviewed representation issues, processes involved in we content mining. Changes in the nature of web poses new challenges in web mining tasks and how algorithms can be changed according to changes. Now Webpage is also seen as a database not only text. This paper describes multimedia content and how they can be mined and preprocessed .Mining is not only limited to text its application areas are increasing day by day.Security,Sensor networks,Arificial Intelligence, Weather Forecasting are upcoming areas for implementing web mining.

10. References

Arvind Arasu , Hector Garcia-Molina (2003) Extracting Structured Data fromWeb Pages , Proceedings of the 19th International Conference on Data Engineering
Bernard J Jansen and Amanda Spink (2003) An Analysis of Web Documents Retrieved andViewed ,The 4th International Conference on Internet Computing. p. 65-69. [3]Dr. Abdallah Alashqur (2008) Mining Association Rules : A Database Perspective. International J. Computer Science and Network Security 8(12): 69-73

Hany Mahgoub, Dietmar RÃ¶sner, Nabil Ismail and Fawzy Torkey (2007) A Text Mining Technique Using Association Rules Extraction : International Journal of Computational Intelligene 4(1): 21-27
Manisha Marathe, Dr. S.H.Patil, G.V.Garje,M.S.Bewoor (2009) Extracting Content Blocks from Web Pages. J. Recent Trends in Engineering 2(4): 62-64
Mirela Pater, Daniela E. Popescu and Daniela Matei (2008) Pattern discovery techniques in Web mining. J. Computer Science and Control Systems. 1(1): 77-81 [7]Przemyslaw Kazienko (2009) Mining indirect association rules for Web .J. Appl. Math. Comput. Sci 19 (1) :165-186 [8]Raymond Kosala and Hendrik Blockeel (July 2000) Web Mining Research: A Survey SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM 2(1): 2-9

Samia Jones and Omprakash K. Gupta (2006) Web Data Mining : A Case Study . Communications of the IIMA 6(4)

:59-62
Zdravko markov,Data mining the Web,Wiley Series
David L.Olson, Dursun Delen ,Advanced Data Mining Techniques,Springer.
Soumen Chakarborti ,Mining the Web , Morgan Kaufmann Series in Data Management Systems
Jackson, Peter, Natural language processing for online applications : text retrieval, extraction, and categorization / Peter Jackson, Isabelle Moulinier. p.cm. (Natural Language Processing, issn 15678202 ; v.5)
Michael W. Berry ,Survey of Text Mining,Springer
Adam Schenker ,Horst Bunke,Mark Last,Abraham Kandel,Graph Theoretic Techniques for Web Content Mining Series in Machine Perception and Artificial Intelligence

Vol. 62
Bing Liu,Web DataMining, Springer
Min Song and Yi-Fang Wu ,Handbook of research on text and web mining techologies ,Information Science Reference
Hsinchun Chen and Michael Chau, Web Mining: Machine learning for Web Applications ,Ch -6, Annual Review of Information Science and Technology.
Xindong Wu, Chengqi Zhang, Shichao Zhang (2005) Database classification for multi-database mining .

Information systems 30(1): 71-88
Zhao Li, Wee Keong Ng , Aixin Sun (2005) Web data extraction based on structural similarity, Knowledge and Information Systems 8(4): 438-461

Volume 02, Issue 06 (June 2013)

Comprehensive Review on Web Content Mining

Comprehensive Review on Web Content Mining

1.1Web Content Mining View of Data

3.1 Graph Representation of a Document

G : Size of Graph V : Vertices

E : Edges

3.2 Standard Graph Representation

DOM Tree Representation

Object Exchange Model

2.3.1) Decision Tree Representation

(4.3.2) Direct Association Rules in the Web

DATABASE VIEW means abstract layer that hides the web under a virtual database view. It is mainly concerned with two objectives:

5.4.1) Multilevel Databases

5.4.2) Architecture of MLDB

6.1.1 Multimedia Mining process

Leave a Reply