Applying Data Mining Techniques for Phrase Extraction in Document Collections

DOI : 10.17577/IJERTCONV8IS03017
Download Full-Text PDF Cite this Publication
Text Only Version

 

Applying Data Mining Techniques for Phrase Extraction in Document Collections

R. Prema1

Research Scholar1,

PG and Research Department of Computer Science and Applications, Vivekanandha College of Arts and Sciences for Women [Autonomous],

Tiruchengode.

Mr. V. P. Muthukumar2

Assistant Professor2,

PG and Research Department of Computer Science and Applications, Vivekanandha College of Arts and Sciences for Women [Autonomous],

Tiruchengode.

Abstract:- Generally, writings have been broke down utilizing different data recovery related techniques, for example, full- content examination, and common language handling. Be that as it may, just few instances of information mining in content, especially in full content, are accessible. In this paper, general information mining techniques are pertinent to content examination assignments, for example, spellbinding expression extraction. Also, present a general system for text mining. The system follows the general information revelation process, in this manner containing steps from preprocessing to the use of the outcomes. The information mining technique that applies depends on summed up episodes and episode rules. It gives solid instances of how to preprocess writings in light of the proposed utilization of the found outcomes and present a weighting plan those aides in pruning out repetitive or non-clear expressions. Likewise present outcomes from genuine information tests.

Keywords: Text Mining, Preprocess, Episode and Episode Rules.

  1. INTRODUCTION

    As of late, have seen the overflowing appearance of exceptionally enormous heterogeneous full- content record assortments, accessible for any end client. The assortment of clients’ desires is wide. The client may require a general perspective on the archive assortment: what themes are secured, what sort of reports exist, are the records some way or another related, and so on. Then again, the client might need to locate a particular snippet of data content. At the other extraordinary, a few clients might be keen on the language itself, e.g., in word uses or semantic structures. A typical element for every one of the assignments referenced is that the client doesn’t know precisely what he/she is searching for. Thus, an information mining approach ought to be proper, in light of the fact that by definition it is finding intriguing regularities or then again exemptions from the information, perhaps without an exact core interest.

  2. GENERAL FRAMEWORK FOR TEXT MINING

    In this approach, consider message as successive data in numerous regards like the information gathered by sensors or other perception frameworks. The general information revelation process adjusted to the errand of content processing is explained in fig 1.

    The beginning point is printed data and the final result is data portraying marvels that are visit in the data,

    e.g. phrases or co-occurring terms. In our approachthis data is displayed as episodes and episode rules.

    Notwithstanding depicting the revelation methods, it clarify the key choices of the preprocessing and post processing stages that are important to center our revelation procedure.

    Select/Filter
    Convert

    Convert

    Select/Filter
    Convert

    Convert

    Purne
    Group
    Order
    Purne
    Group
    Order

     

    Morphological Analysis Program

    Episode Rule Alogorithm

    Episode

    Rules

    Preprocessing

    Discovery

    Post preprocessing

    Preprocessing

    Discovery

    Post preprocessing

    Fig1: Know ledge discovery from textual representation into episodes and episode rules

  3. APPLICATIONS
    1. Information retrieval tasks

      In data recovery or all the more explicitly message recovery watchwords and key expressions are normally utilized to support inquiry handling. Consider a typical data recovery task: The client communicates his/her data needs, e.g., by giving an inquiry, and the framework executes the hunt by coordinating the question with the records. With huge assortments just checking the content isn’t achievable. Subsequently, a lot of agent watchwords must be chosen and appended to the records. For this reason, single-term watchwords might be too wide to ever be utilized alone. Expressions comprising of arrangements of related words convey a more explicit importance than the single terms remembered for the expressions.

      A lot of expressions can be viewed as a substance descriptor that ought to recognize the report from different records in the assortment. Notwithstanding basic inquiries, content descriptors can be utilized for different content grouping undertakings. For example, reports can be grouped by their closeness, e.g., to imagine a huge record assortment .

      Despite the fact that ordering and choosing watchwords are well studied inside data recovery; new difficulties have been as of late set by the abrupt appearance of enormous heterogeneous full content archive assortments. Lewis and Spärck Jones consider compound key terms as one fundamental probability to improve the nature of content recovery in this new circumstance. They likewise underline the need of comprehensive testing to figure out what the exact structure of these compound

      terms ought to be, and how they ought to be chosen and weighted comparative with their constituents.

      The preprocessing required for finding key expressions is genuinely clear. The syntactic highlights of the words are not utilized, and normally they are intrigued either in the first word or in its base structure (e.g., handling or on the other hand process). The auxiliary data and accentuation marks are normally dropped, however they may influence the holes in the ordering plan, e.g., it is frequently wanted that the words in an expression happen in a similar sentence. Basic capacity words (relational words, articles, and so on.) are pruned.

    2. Natural language processing

Another application territory is breaking down the etymological highlights of the content. It has considered three regular language preparing applications:

  1. Finding syntactic principles
  2. Finding collocations and
  3. Developing summed up concordances

The linguistic guidelines that consider here are any rules portraying conditions between semantic highlights connected to words. For example, it might need to contemplate the structure of sentences by finding the arranged successions of grammatical features. The preprocessing requires leaving just the chose morphological highlights while the 5 real word structures are pruned. Contingent upon the focal point of the examination, whole element vectors of certain words and accentuation imprints might be pruned too. Postprocessing may incorporate arranging and gathering of rules as indicated by a few highlights.

Collocations are intermittent blends of words comparing to subjective word utilizations. In contrast to common xpressions utilized in data recovery, collocations regularly contain relational words and curved words. The distincts three kinds of collocations:

  1. Predicative relations
  2. Inflexible thing phrases and
  3. Phrasal layouts that may contain void spaces Notwithstanding the etymological intrigue,

    collocations might be helpful in recovery errands. It has been indicated that a few sorts of collocations are space subordinate and, subsequently, great markers of the points secured by the archive. What sort of collocations are viewed as intriguing relies upon the proposed application. On the off chance that they are utilized as substance descriptors in data recovery, their separating capacity is one measure.

    A generally utilized device for looking at word uses in a few assortments of writings is developing concordances: all the events of a given word in the assortment are recorded together with the unique circumstance, i.e., the words showing up preceding and after the word. In the event that the assortment is enormous, in any case, concordances can give a lot of information. One approach to gather distinctive word utilizes or to rank them in request of significance is to sort concordance lines as per the collocations encompassing the word, think about a much further developed methodology, purported summed up concordances: visit designs that may contain both words and syntactic highlights. Preprocessing may drop all sentences that don’t contain the given word. Plausibility is to expel the scenes or scene rules not containing the word in the postprocessing stage. As the highlights of an element vector are taken care of as one substance, the current revelation strategy doesn’t deliver concordances counting both word structures and linguistic highlights. For example, concentrating the word discover, the pattern

    knowledge /discovery/ in [N]

    cannot be produced. However, the instances of this pattern, e.g., knowledge /discovery/ in databases

    can be discovered utilizing some grep-like instrument, given that unique word structures and grammatical forms are remembered for the component vector.

  4. CONCULSION

    In this paper, indicated that general information mining techniques are appropriate to content examination assignments. Moreover exhibited a general structure for content mining. The system follows the general KDD processin this way containing steps from preprocessing to the usage of the results gave solid instances of how to pre and postprocess writings dependent on the planned utilization of the found outcomes. In additionally introduced model applications from data recovery and common language handling and exhibited the relevance of our approach with probes real life information. In further analysis, intend to review whether the outcomes can be utilized in improving the generally availability of documents and which instruments are expected to make the examination of a apparently enormous assortment of scenes and scene governs more efficient.

  5. REFERENCES
  1. F. Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143177, 1993.
  2. R. Feldman, I. Dagan, and W. Klösgen. Efficient algorithms for mining and manipulating associations in texts. In Cybernetics and Systems, Volume II, The Thirteenth European Meeting on Cybernetics and Systems Research, Vienna, Austria, Apr. 1996.
  3. R. Feldman, W. Kloesgen, and A. Zilberstein. Document explorer: Discovering knowledge in document collections. In Z.

    W. Ras and A. Skowron, editors, Proceedings of Tenth International Symposium on Methodologies for Intelligent Systems (ISMIS97), number 1325 in Lecture Notes in Artificial Intelligence, pages 137146, Charlotte, North Carolina, USA, Oct. 1997. Springer-Verlag.

  4. D. R. Cutting, D. Karger, J. Pedersen, and J. W. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In N. Belkin, P. Ingwersen, and A. Mark Pejtersen, editors, Proceedings of the 15th Annual International ACM/SIGIR Conference (SIGIR92), pages 318329, Copenhagen, Denmark, June 1992.
  5. D. D. Lewis and K. Spärck Jones. Natural language processing for information retrieval. Communications of the ACM, 39(1):92101, 1996..
  6. D. Biber. Co-occurrence patterns among collocations: a tool for corpus-based lexical knowledge. Computational Linguistics, 19(3):531538, 1993.

Leave a Reply