Data Mining and Knowledge Discovery

Dr .T . Swarnalatha; V . Sireesha

doi:10.17577/IJERTCONV2IS15022

NCDMA - 2014 (Volume 2 - Issue 15)

Data Mining and Knowledge Discovery

DOI : 10.17577/IJERTCONV2IS15022

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 287
Total Downloads : 12
Authors : Dr .T . Swarnalatha, V . Sireesha
Paper ID : IJERTCONV2IS15022
Volume & Issue : NCDMA – 2014 (Volume 2 – Issue 15)
Published (First Online): 30-07-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Data Mining and Knowledge Discovery

Dr .T . SWARNALATHA, V . Sireesha

Professor & HOD Assistant Professor

Dept of MCA Dept of MCA

Narayana Engineering College,Nellore

Abstract: – Data mining is a process that uses a variety of data analysis tools to discover patterns andrelationships in data that may be used to make valid predictions.The first and simplest analytical step in data mining is to describe the data – summarize its statisticalattributes (such as means and standard deviations), visually review it using charts and graphs, andlook for potentially meaningful links among variables. In the Data Mining Process, collecting, exploring and selecting the rightdata are critically important.Knowledge Discovery demonstrates intelligent computing at its best, and is the mostdesirable and interesting end-product of Information Technology. To be able to discoverand to extract knowledge from data is a task that many researchers and practitioners are endeavoring to accomplish. There is a lot of hidden knowledge waitingto be discovered this is the challenge created by todays abundance of data.Knowledge Discovery in Databases (KDD) is the process of identifying valid,novel, useful, and understandable patterns from large datasets.

Keywords:Data mining, knowledge discovery, machine learning,datasets

I. INTRODUCTION:

Data Mining (DM) is the mathematical core of the KDD process, involving the inferring algorithms that explore the data, develop mathematical models and discover significant patterns

(implicit or explicit) which are the essence of useful knowledge. Advances in data gathering storage and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases is a rapidly growing area of research and application that builds on techniques and theories from many fields including statistics databases pattern recognition and learning data visualization uncertainty modeling data warehousing and OLAP optimization and high performance computing. KDD is concerned with issues of scalability, the multi-step

knowledge discovery process for extracting useful patterns and models from raw data stores (including data cleaning and noise modeling) and issues of making discovered patterns understandable.

Knowledge Discovery includes: Theory and Foundational Issues: Data and knowledge representation; modeling of structured textual and multimedia data; uncertainty management; metrics of interestingness and utility of discovered knowledge; algorithmic complexity efficiency and scalability issues in data mining; statistics over massive data sets. Data Mining Methods: including classification clustering probabilistic modeling prediction and estimation dependency analysis search and optimization. Algorithms for data mining including spatial textual and multimedia data (e.g. the Web) scalability to large databases parallel and distributed data mining techniques and automated discovery agents.

THE KDD PROCESS:

The knowledge discovery process is iterative and interactive, consisting of several steps. The process starts with determining the KDD goals, and ends with the implementation of the discovered knowledge. As a result, changes would have to be made in the application domain (such as offering different features to mobile phone users in order to reduce churning). This closes the loop, and the effects are then measured on the new data repositories, and the KDD process is launched again. The following are the steps that are used:
The following figure presents a summary corresponding to the relative effort spent on each of the DMKD steps.
DATA MINING METHODOLOGY:

It should be clear from the above that data mining is not a singletechnique; any method that will help to get more information outof data is useful. Different methods serve different purposes, eachmethod offering its own advantages and disadvantages. However,most methods commonly used for data mining can be classifiedinto the following groups.

Statistical Methods: Historically, statistical work has focusedmainly on testing of preconceived hypotheses and on fittingmodels to data. Statistical approaches usually rely on an explicitunderlying probability model. In addition, it is generally assumedthat these methods will be used by statisticians, and hence humanintervention is required for the generation of candidate hypothesesand models.

Case-Based Reasoning:Case-based reasoning (CBR) is atechnology that tries to solve a given problem by making directuse of past experiences and solutions. A case is usually a specific

problem that has been previously encountered and solved. Given aparticular new problem, case-based reasoning examines the set ofstored cases and finds similar ones. If similar cases exist, theirsolution is applied to the new problem, and the problem is addedto the case base for future reference.

Neural Networks:Neural networks (NN) are a class of systemsmodeled after the human brain. As the human brain consists ofmillions of neurons that are interconnected by synapses, neural

networks are formed from large numbers of simulated neurons,connected to each other in a manner similar to brain neurons. Likein the human brain, the strength of neuron interconnections maychange (or be changed by the learning algorithm) in response to apresented stimulus or an obtained output, which enables the

network to learn.

Decision Trees:A decision tree is a tree where each non- terminal node represents a test or decision on the considered data item.Depending on the outcome of the test, one chooses a certainbranch. To classify a particular data item, we start at the root nodeand follow the assertions down until we reach a terminal node (orleaf). When a terminal node is reached, a decision is made.Decision trees can also be interpreted as a special form of a ruleset, characterized by their hierarchical organization of rules.

Rule Induction:Rules state a statistical correlation between theoccurrence of certain attributes in a data item, or between certaindata items in a data set. The general form of an association rule isXl ^ … ^ Xn => Y [C, S], meaning that the attributes Xl,…,Xnpredict Y with a confidence C and a significance S.

Bayesian Belief Networks:Bayesian belief networks (BBN) aregraphical representations of probability distributions, derivedfrom co-occurrence counts in the set of data items. Specifically, aBBN is a directed, acyclic graph, where the nodes representattribute variables and the edges represent probabilisticdependencies between the attribute variables. Associated witheach node are conditional probability distributions that describethe relationships between the node and its parents.

Genetic algorithms / Evolutionary Programming:Genetic algorithms and evolutionary programming are algorithmicoptimization strategies that are inspired by the principles observed

in natural evolution. Of a collection of potential problem solutionsthat compete with each other, the best solutions are selected andcombined with each other. In doing so, one expects that the

overall goodness of the solution set will become better and better,similar to the process of evolution of a population of organisms.Genetic algorithms and evolutionary programming are used indata mining to formulate hypotheses about dependencies betweenvariables, in the form of association rules or some other internalformalism.

Fuzzy Sets:Fuzzy sets form a key methodology for representingand processing uncertainty. Uncertainty arises in many forms intodays databases: imprecision, non- specificity, inconsistency,vagueness, etc. Fuzzy sets exploit uncertainty in an attempt tomake system complexity manageable. As such, fuzzy setsconstitute a powerful approach to deal not only with incomplete,noisy or imprecise data, but may also be helpful in developinguncertain models of the data that provide smarter and smootherperformance than traditional systems.

Since fuzzy systems cantolerate uncertainty and can even utilize language-like vaguenessto smooth data lags, they may offer robust, noise tolerant modelsor predictions in situations where precise input is unavailable ortoo expensive.

Rough Sets: A rough set is defined by a lower and upper bound ofa set. Every member of the lower bound is a certain member of theset. Every non-member of the upper bound is a certain nonmemberof the set. The upper bound of a rough set is the unionbetween the lower bound and the so-called boundary region. Amember of the boundary region is possibly (but not certainly) a

member of the set. Therefore, rough sets may be viewed as fuzzysets with a three-valued membership function (yes, no, perhaps).Like fuzzy sets, rough sets are a mathematical concept dealingwith uncertainty in data ([42]). Als like fuzzy sets, rough sets areseldom used as a stand-alone solution; they are usually combinedwith other methods such as rule induction, classification, orclustering methods.
INVESTIGATION OF KNOWLEDGEDISCOVERY AND DATA MINING TOOLSUSING A FEATURE

CLASSIFICATIONSCHEME:

In this section we first provide a feature classification scheme tostudy knowledge discovery and data mining tools. We then applythis scheme to review existing tools that are currently available,either as a research prototype or as a commercial product.Although not exhaustive, we believe that the reviewed productsare representative for the current status of technology.As discussed in Section 2, knowledge discovery and data miningtools require a tight integration with database systems or datawarehouses for data selection, preprocessing, integrating,transformation, etc. Not all tools have the same databasecharacteristics in terms of data model, database size, queriessupported, etc. Different tools may perform different data miningtasks and employ different methods to achieve their goals. Somemay require or support more interaction with the user than theother. Some may work on a stand-alone architecture while theother may work on a client/server architecture. To capture allthese differences, we propose a feature classification scheme thatcan be used to study knowledge discovery and data mining tools.

In this scheme, the tools features are classified into three groupscalled general characteristics, database connectivity, and data mining characteristics which are described below.
CONCLUSIONS AND FUTURERESEARCH

Knowledge discovery can be broadly defined as the automateddiscovery of novel and useful information from commercialdatabases. Data mining is one step at the core of the knowledge

discovery process, dealing with the extraction of patterns andrelationships from large amounts of data.Today, most enterprises are actively collecting and storing largedatabases. Many of them have recognized the potential value ofthese data as an information source for making business decisions.

The dramatically increasing demand for better decision support isanswered by an extending availability of knowledge discovery anddata mining products, in the form of research prototypes

developed at various universities as well as software productsfrom commercial vendors. In this paper, we provide an overviewof common knowledge discovery tasks, approaches to solve these

tasks, and available software tools employing these approaches.

However, despite its rapid growth, KDD is still an emerging field.The development of successful data mining applications stillremains a tedious process ([21]). The following is a (naturallyincomplete) list of issues that are unexplored or at least notsatisfactorily solved yet:

REFERENCES

Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern RecognitionLetters, 27(14): 16191631, 2006,
Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition withGrouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592- 3612, 2007.
Hastie, T. and Tibshirani, R. and Friedman, J. and Franklin, J., The elements of statisticallearning: data mining, inference and prediction, The Mathematical Intelligencer, 27(2): 8385, 2005.
Han, J. and Kamber, M., Data mining: concepts and techniques, Morgan Kaufmann, 2006. H. Kriege, K. M. Borgwardt, P. Krger, A. Pryakhin, M. Schubert and Arthur Zimek, Futuretrends in data mining, Data Mining and Knowledge Discovery, 15(1):87-97, 2007.
Larose, D.T., Discovering knowledge in data: an introduction to data mining, JohnWiley andSons, 2005.Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductorsmanufacturing case study, in Data Mining for Design and Manufacturing: Methods andApplications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311336, 2001.
Maimon O. and Rokach L., Improving supervised learning by feature decomposition, Proceedingsof the Second International Symposium on Foundations of Information andKnowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002.
Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial Intelligence – Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005.

Data Mining and Knowledge Discovery

Leave a Reply