Knowledge Discovery in Text Mining Using Effective Pattern Techniques

DOI : 10.17577/IJERTV3IS10280

Download Full-Text PDF Cite this Publication

Text Only Version

Knowledge Discovery in Text Mining Using Effective Pattern Techniques

Miss. G. Sivagami

Assistant Professor Department of Computer Science and

Engineering SRM University

Tamil Nadu, India- 603203

Srivatsasa Janaswamy

Department of Computer Science and Engineering

SRM University Tamil Nadu, India- 603203.

Abstract

Knowledge discovery is a process of non-trivial extraction of information from large data base, information that is implicitly presented in the data, which is previously unknown and useful for users. Mining the data is necessary step in the process of discovering the unknown information. Text mining is the discovery of interesting knowledge in text documents. It is a quite challenging issue to find accurate knowledge in text documents to help users to find what they want. Many data mining techniques have been proposed for mining useful patterns in text documents. The current approach has evaluated terms support, discovered closed patterns, frequent patterns and closed sequential patterns by implementing techniques like pre-processing, pattern taxonomy model, pattern deploying, pattern evolving and shuffling methods. This paper represents few effective data mining techniques to extract the terms from data set by searching in frequent patterns, closed patterns, closed sequential patterns and evaluates the time taken to extract the terms, by using time complexities.

Keywords- Text Mining, Pattern Evolving, Pattern Deploying, Term Extraction, Time Evaluation.

  1. Introduction

    Knowledge discovery can be viewed as the process of nontrivial extraction of information from large databases, information that is implicitly presented in the data, previously unknown and potentially useful for

    users. Data mining is therefore an essential step in the process of knowledge discovery in databases.

    Data mining can be more fully characterized as the extraction of implicit, previously unknown, and potentially useful information from data. With text mining, however, the information to be extracted is clearly and explicitly stated in the text. This paper deals with data mining techniques like pattern taxonomy model, pattern deploying, pattern evolving, shuffling, term extraction, time evaluation which can extract the required information useful for a user and calculates how much time is taken to extract the term. With a large number of patterns generated by using data mining approaches, how to effectively use and update these patterns is still an open research issue. It is a challenging issue to find accurate knowledge (or features) in text documents to help users to find what they want.

    In the beginning, Information Retrieval (IR) provided many term-based methods to solve this challenge, such as rocchio and probabilistic models, rough set models, BM25 and support vector machine (SVM), based filtering models. Term based approaches have been followed in order to discover knowledge useful for user. However term based approaches takes more time as term to term comparisons should be done. The semantic meaning of many discovered terms is uncertain for answering what users want. Over the years, people have often held the hypothesis that phrase-based approaches could perform better than the term based ones, as phrases may carry more semantics like information. Phrase

    based approach have been discouraged although they are less ambiguous, more discriminative than individual terms. The likely reasons for the discouraging performance include: 1) phrases have inferior statistical properties to terms, 2) they have low frequency of occurrence, and 3) there are large numbers of redundant and noisy phrases among them.

    To overcome the disadvantages of phrase-based approaches, pattern mining-based approaches (or pattern taxonomy models) have been proposed, which adopted the concept of closed sequential patterns, and pruned non- closed patterns. There are two fundamental issues regarding the effectiveness of pattern- based approaches: low frequency and misinterpretation. Given a specified topic, a highly frequent pattern (normally a short pattern with large support) is usually a general pattern, or a specific pattern of low frequency. If we decrease the minimum support, a lot of noisy patterns would be discovered. Misinterpretation means the measures used in pattern mining (e.g., support and confidence). Many terms with larger weights (e.g., the term frequency and inverse document frequency (tf*idf) weighting scheme) are general terms because they can be frequently used in both relevant and irrelevant information. For example, term LIB may have larger weight than JDK in certain of data collection; but we believe that term JDK is more specific than term LIB for describing Java Programming Language; and term LIB is more general than term JDK because term LIB is also frequently used in C and C++.

    Therefore, it is not adequate for evaluating the weights of the terms based on their distributions in documents for a given topic, although this evaluating method has been frequently used in developing IR models. Hence discovered specifities of patterns are calculated and then evaluates term weights according to the distribution of terms in the discovered patterns rather than the distribution in documents for solving the misinterpretation problem. It also considers the influence of patterns from the negative training examples to find ambiguous (noisy) patterns and try to reduce their influence for the low-frequency problem. The process of

    updating ambiguous patterns can be referred as pattern evolution.

    The process carried out in the current scenario is started with pre-processing technique that splits into terms, terms are stemmed, and stop words are removed. Terms are analyzed to discriminate the meaningful and non-meaningful terms. Positive and negative documents both undergo pattern taxonomy model technique for discovering closed patterns, frequent patterns and closed sequential patterns. Discovered patterns deployed summarized as d-patterns in order to find the term support based upon their occurrence in patterns rather than their occurrence in documents. With searching process done in frequent patterns, closed patterns, and closed sequential patterns, required information is discovered in form of terms. However time taken for extraction for terms will be calculated based on time complexity of each above mentioned technique.

  2. Related Works

    Rose. T, Stevenson. M and Whitehead. M introduced data set RCV1, the motivations behind its creation, and how it differs from previous corpora. RCV1 maintained by Reuters, is an archive of 806,791 English language news stories. Reuters, the global information, news and technology group, has for the first time made available free of charge, large quantities of archived Reuters news stories for use by research communities around the world. In addition, the category coding has been done, where by each story is annotated for topic, region and industry sector. The process by which these codes were applied, and examine the issues involved in maintaining quality and consistency of coding in an operational, commercial environment were discussed.

    Porter.M.F describes the automatic retrieval of suffixes from words in English which are of particular interest in field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. It effectively works by treating complex suffixes as compounds made

    up of simple suffixes in a number of steps. In each step the removalof suffix is made to depend upon the remaining stem, which usually involves a measure of the remaining stem, it usually measures of syllable length. In current paper, the stemming process is used in text pre- processing technique to reduce the terms in the data set.

    Wu.T.S, Li.Y, and Xu.Y proposed methods which adopt the mining sequential pattern technique to find semantic patterns from text documents and then deploy these patterns using proposed deploying algorithms. In addition, sequential patterns have also been discovered using SPMining algorithm. This study also indicates that pattern refinement is the key to improve effectiveness for pattern-based methods. The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCV1 and the results show that the effectiveness is improved by using pattern refinement approaches.

    Li.Y and Zhong.N proposed a novel approach to discover ontologies from data sets in order to build complete concept models for web user information needs. As it is not easy to obtain the right information from web for a particular web user or a group of users due to the obstacle of automatically acquiring web user profiles, novel approach have been developed. The composition relations are used in order to combine patterns and to find out support of terms in d-patterns. It also proposes a method for capturing evolving patterns to refine using discovered ontologies. In addition the process of assessing relevance in ontology is established.

  3. Proposed Work

    As the existing system implemented evaluation of term support based up on their occurrence in documents, where as the proposed system is to extract the information what a user wants in the form of terms or patterns. The data set is loaded into the data base. The data set undergoes text pre-processing phase which includes tokenization, parts of speech, word stemming and stop word removal methods. Then terms present in the data set are analyzed in order

    to find out which are positive documents and which are negative documents. Pattern taxonomy model, pattern deploying techniques were implemented on positive data set in next phase one after other to discover patterns. Simultaneously, negative sets undergo pattern evolving and shuffling techniques. Therefore the term supports are evaluated. Terms will be extracted from term extraction model. Time is evaluated in time evaluation module which displays how much time taken to do the above whole process.

    Figure 1. Proposed Architecture Model Flow

    1. Pre-Processing

      The preprocessing phase of the study converts the original textual data in a data- mining-ready structure, where the most significant text-features that serve to differentiate between text-categories are identified, that means unstructured text documents are processed using natural language processing techniques to extract keywords labeling the items in that text documents. An effective preprocessor represents the document efficiently in terms of both space (for storing the document) and time (for processing retrieval requests) requirements and maintain good retrieval performance. The main objective of preprocessing is to obtain the key features or key terms from text documents and to

      enhance the relevancy between word and document and the relevancy between word and category. After this phase, classical data mining techniques are applied on the extracted data (keywords) to discover interesting patterns. Preprocessing technique contains

      • Tokenization

      • Parts of speech

      • Word stemming

      • Stop word removal

            1. Tokenization

              Given a character sequence and a defined document unit, tokenization is the task of hopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. These tokens are often loosely referred to as terms or words. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

            2. Parts of Speech

              A POS tagger marks the words in a text with labels corresponding to the part-of-speech of the word in that context. Part of speech tags the words according to grammatical context of words by dividing it into nouns, verbs and more. They are few typical tags for parts of speech they are NN (single noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., and, or), PRP (pronoun)

              MD (modal auxiliary, e.g., can, will)

            3. Word Stemming

        Stemming techniques are used to find out the root/stem of a word. Stemming converts words to their stems, which incorporates a great deal of language-dependent linguistic knowledge. Behind stemming, the hypothesis is that words with the same stem or word root mostly describe same or relatively close concepts in text and so words can be conflated by using stems. The advantages of using stemming procedure is Stemming improves effectiveness of IR and text mining Matching similar words, mainly improve recall. It reduces indexing size,

        Combing words with same roots may reduce indexing size as much as 40-50%.

        3.1.2 Stop Word Removal

        Many of the most frequently used words in English are useless in Information Retrieval (IR) and text mining. These words are called 'Stop words. Stop-words, which are language-specific functional words, are frequent words that carry no information (i.e., pronouns, prepositions, conjunctions). Stop words are needed as it reduces indexing (or data) file size, Improve efficiency and effectiveness.

          1. Term Analysis

            Term analysis phase was presented to bridge the gap between NLP and text mining, which analyzed terms on the sentence and document levels. The advantage of the term analysis model is that it can effectively discriminate between non-important terms and meaningful terms which describe a sentence meaning. Term analysis model usually relies upon its employed NLP techniques.

          2. Pattern Taxonomy Model

        In this phase, frequent patterns, sequential patterns, closed sequential patterns have been extracted. To improve the efficiency of the pattern taxonomy mining, an algorithm, SPMining, was proposed to find all closed sequential patterns, which used the well-known Apriori property in order to reduce the searching space. All the documents taken are split into paragraphs. So a given document d yields a set of paragraphs PS(d). Let D be a training set of documents, which consists of a set of positive documents, D+; and a set of negative documents, D-. Let T= {t1, t2 . . . tm} be a set of terms (or keywords) which can be extracted from the set of positive documents, D+. Given a termset X in document d, X is used to denote the covering set of X for d, which includes all paragraphs dp PS(d) such that X is subset of dp, i.e., X=

        {dp/dp PS(d), X is subset of dp}. Frequent patterns are discovered in this phase using apriori algorithm in order to reduce the searching space

        for user. Frequent patterns are discovered based on absolute support and relative support.

        Absolute support is the number of occurrences of X in PS(d) denoted by supa(X) =

        |x|. Relative support is the fraction of the paragraphs that contain the pattern denoted by supr(X) = |X|/|PS(d)|. A termset X is called frequent pattern if its supr (or supa) min sup, a minimum support. Table 1 lists a set of paragraphs for a given document d, where PS(d)

        ={dp1,dp2,dp6},and duplicate terms were removed.

        Table 1. A Set of Paragraphs

        Paragraph

        Terms

        dp1

        t1 t2

        dp2

        t3 t4 t6

        dp3

        t3 t4 t5 t6

        dp4

        t3 t4 t5 t6

        dp5

        t1 t2 t6 t7

        dp6

        t1 t2 t6 t7

        Let min sup = 50%, ten frequent patterns are obtained in Table 1 using the above definitions. Table 2 illustrates the ten frequent patterns and their covering sets.

        Table 2. Frequent patterns and Covering Sets

        Not all frequent patterns in Table 2 are useful. For example, pattern {t3, t4} always occurs with term t6 in paragraphs, i.e., the shorter pattern, {t3, t4}, is always a part of the larger pattern, {t3, t4, t6}, in all of the paragraphs. Hence, the shorter one, {t3; t4}, is a noise pattern and expect to keep the larger pattern, {t3, t4, t6}, only.

        Given a termset X, its covering set X is a subset of paragraphs. Similarly, given a set of paragraphs Y is a subset of PS(d), Its termset can be defined, which satisfies

        Termset(Y) = {t| dp Y => t dp}

        The closure of X is defined as follows:

        Cls(X) = termset(X).

        A pattern X (also a termset) is called closed if and only if X = Cls(X).

        Let X be a closed pattern. We can prove that

        Supa(X1) < supa(X),

        For all patterns X is a subset of X1; otherwise, if supa(X1) = supa(X), we have

        X1 = X

        where supa(X1) and supa(X) are the absolute support of pattern X1 and X, respectively.

        Pattern taxonomy has been evaluated to discover closed patterns. Patterns can be structured into a taxonomy by using the is-a (or subset) relation. For the example of Table 1, where a set of paragraphs of a document are illustrated, and the discovered 10 frequent patterns in Table 2 if assuming min_sup =50%. There are, however, only three closed patterns in this example. They are {t3, t4, t6}, {t1, t2}, and

        {t6}.

        Frequent patterns Covering sets

        {t3, t4, t6}

        {t3, t4}

        { t3, t6}

        {t4, t6}

        {t3}

        {t4}

        {t1, t2}

        {t1}

        {t2}

        {t6}

        {dp2, dp3, dp4}

        { dp2, dp3, dp4}

        { dp2, dp3, dp4}

        { dp2, dp3, dp4}

        { dp2, dp3, dp4}

        { dp2, dp3, dp4}

        {dp1, dp5, dp6}

        { dp1, dp5, dp6}

        { dp1, dp5, dp6}

        {dp2, dp3, dp4, dp5}

        Fig 1 illustrates an example of the pattern taxonomy for the frequent patterns in Table 2, where the nodes represent frequent patterns and their covering sets; non-closed patterns can be pruned; the edges are is-a relation. After pruning, some direct is-a retaliations may be changed, for example, pattern

        {t6} would become a direct sub pattern of {t3, t4, t6} after pruning non-closed patterns. From frequent patterns and closed patterns, closed sequential patterns have been discovered using SPMining algorithm.

        Figure 2. Pattern Taxonomy

          1. Pattern Deployment Method

            In order to use the semantic information in the pattern taxonomy to improve the performance of closed patterns in text mining, discovered patterns must be interpreted by summarizing them as d-patterns in order to accurately evaluate term weights (supports). The rational behind this motivation is that d-patterns include more semantic meaning than terms that are selected based on a term-based technique (e.g., tf*idf). The evaluation of term weights (supports) is different to the normal term-based approaches. In the term-based approaches, the evaluations of term weights are based on the distribution of terms in documents. In this research, terms are weighted according to their appearances in discovered closed patterns.

            The result of the composition is still a set of term-number pairs. Formally, for all positive documents di D+, its closed patterns are deployed on a common set of terms T in order to obtain the following d-patterns (deployed patterns, non-sequential weighted patterns):

            [di] = { (ti1,ni1), (ti2, ni2), ( tim,nim)},

            Where tij in pair tIj nij denotes a single term and

          2. Inner Pattern Evolution

            The method proposed, how to reshuffle supports of terms with in normal forms of d- patterns based on negative documents in the training set are defined. The technique will be useful to reduce the side effects of noisy patterns because of the low-frequency problem. This technique is called inner pattern evolution here, because it only changes a patterns term supports within the pattern.

            A noise negative document nd in D- is a negative document that the system falsely identified as a positive. In order to reduce the noise, d-patterns have been have been tracked down which gives rise to such an error. Those are called patterns offenders of nd. An offender of nd is a d-pattern that has at least one term in nd.

            The basic idea of updating patterns is explained as follows: complete conflict offenders are removed from d-patterns first. For partial conflict offenders, their term supports are reshuffled in order to reduce the effects of noise documents.

          3. Shuffling

        The d-patterns from negative documents have been discovered from inner pattern evolution method. So again patterns have been updated. The patterns must be reshuffled such that term supports will be changed.

          1. Term Extraction

            The searching process undergoes in frequent patterns, closed and closed sequential patterns in order to extract the information which a user wants in form of terms. Along with terms,

            nij

            is its support in di

            which is the total absolute

            paragraph (D+ or D-) is extracted which contains

            supports given by closed patterns that contain tij; or nij is total number of closed patterns that contain tij.

            The process of calculating d-patterns can be easily can be described using pattern taxonomy model algorithm.

            terms. If the term what is required has found out during searching process in any of the extracted patterns, searching is aborted.

          2. Time Evaluation

        In term based approach, required term should be compared with each and every term in the document which takes lot of time as data mining is the process that should be done on

        large volume of data. Hence information extraction is done using data mining techniques to extract in very less time. To evaluate time taken for extraction, time complexity is calculated for extracting frequent patterns, closed patterns and closed sequential patterns. The total time complexity gives time taken to extract the term.

  4. Experimental Method

    In the experimental procedure, few users will be allowed to choose the particular document what they want. In search option of this application developed, users enter the required keyword of their interest, such that selected document undergoes preprocessing, terms are analyzed, frequent patterns, closed and closed sequential patterns are extracted. If the required key word is obtained in any one of extracted patterns, searching process will be aborted. All these results will be analyzed along with the feedback taken from every user to know the accuracy of process. The time taken for obtaining the optimist result will be compared with time taken for term based approach. Hence it can be proved that proposed approach is more accurate than term based approach.

  5. Conclusion

    This paper is intended to extract the useful information for users what he want. The extracted information will be in form of terms. In preprocessing, the document is tokenized, word stemming is done, and stop words are removed. The terms are analyzed, which undergoes pattern taxonomy, pattern deployment, pattern evolving, shuffling and term extraction to get accurate result.

  6. References

  1. T. Rose, M. Stevenson, and M. Whitehead, The Reuters Corpus Volume1From Yesterdays News to Todays Language Resources, Proc. Third Intl Conf. Language Resources and Evaluation, pp. 29-31, 2002.

  2. M.F. Porter, An Algorithm for Suffix Stripping, Program, vol. 14, no. , pp. 130- 137, 1980.

  3. S.-T. Wu, Y. Li, and Y. Xu, Deploying Approaches for Pattern Refinement in Text Mining Proc. IEEE Sixth Intl Conf. Data Mining (ICDM 06), pp. 1157-1161, 2006.

  4. Y. Li and N. Zhong, Mining Ontology for Automatically Acquiring Web User Information Needs, IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp. 554-568, Apr. 2006.

  5. Efficient Pattern Discovery for Text Mining, Ning Zhong, Yuefeng Li, Sheng-Tang Wu. IEEE Transactions on Knowledge and Data Engineering, VOL. 24, NO. 1, JANUARY 2012

  6. Text Mining Approaches to Extract Interesting Association Rules from Text Documents Authors: Vishwadeepak Singh Baghela, Dr.S.Tripathi.

  7. Evaluating Preprocessing Techniques in Text Categorization. Authors: V. Srividhya, R.Anitha.

Leave a Reply