- Open Access
- Total Downloads : 5
- Authors : Shivali, Joni Birla, Gurpreet
- Paper ID : IJERTCONV3IS10051
- Volume & Issue : NCETEMS – 2015 (Volume 3 – Issue 10)
- Published (First Online): 24-04-2018
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Knowledge Discovery in Data-Mining
Shivali1, Joni Birla2, Gurpreet3
1,2,3Department of Computer Science &Engineering,
Ganga Institute of Technology and Management, Kablana, Jhajjar, Haryana, India
Abstract- Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD) an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
A warehouse is a commercial building for storage of goods. It is
manufacturers, importers, exporters, wholesalers,transport b usinesses, customs, etc. They are usually large plain buildings in industrial areas of cities and towns and villages.
Advances in data gathering storage and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields including statistics databases pattern recognition and learning data visualization uncertainty modelling data warehousing and OLAP optimization and high performance computing. KDD is concerned with issues of scalability the multi-step knowledge discovery process for extracting useful patterns and models from raw data stores (including data cleaning and noise modelling) and issues of making discovered patterns understandable. Data Mining and Knowledge Discovery is intended to be the premier technical publication in the field providing a resource collecting relevant common methods and techniques and a forum for unifying the diverse constituent research communities. The journal publishes original technical papers in both the research and practice of DMKD surveys and tutorials of important areas and techniques and detailed descriptions of significant applications. Short application summaries are published in a special section. The journal accepts paper submissions of any work relevant to DMKD. A summary of the scope of Data Mining and Knowledge Discovery includes Theory and Foundational Issues: Data and knowledge representation; modelling of structured textual and multimedia data; uncertainty management; metrics of interestingness and utility of discovered knowledge; algorithmic complexity efficiency and scalability issues in data mining; statistics over massive data sets. Data Mining Methods: including classification clustering probabilistic modelling prediction and estimation dependency analysis search and optimization. Algorithms for data mining including spatial textual and multimedia data (e.g. the Web) scalability to large databases parallel and distributed
data mining techniques and automated discovery agents. Knowledge Discovery Process: Data pre-processing for data mining including data cleaning selection efficient sampling and data reduction methods; evaluating consolidating and explaining discovered knowledge; data and knowledge visualization; interactive data exploration and discovery. Application Issues: Application case studies; data mining systems and tools; details of successes and failures of KDD; resource/knowledge discovery on the Web; privacy and security issues.
WHAT DOES KNOWLEDGE DISCOVERY IN DATABASES (KDD) MEAN?
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results. Major KDD application areas include marketing, fraud detection, telecommunication and manufacturing.
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING
It brings together the latest researchin statistics, databases, machine learning, and artificial intelligence that are part of the exciting and rapidly growing field of Knowledge Discovery and Data Mining. Topics covered include fundamental issues, classification and clustering, trend and deviation analysis, dependency modeling, integrated discovery systems, next generation database systems, and application case studies. The contributors include leading researchers and practitioners from academia, government laboratories, and private industry.
The last decade has seen an explosive growth in the generation and collection of data. Advances in data collection, widespread use of bar codes for most commercial products, and the computerization of many business and government transactions have flooded us with data and generated an urgent need for new techniques and tools that can intelligently and automatically assist in transforming this data into useful knowledge. This book is a timely and comprehensive overview of the new generation of techniques and tools for knowledge discovery in data.
Fig 1. Knowledge Discovery Process
WHAT IS THE KNOWLEDGE DISCOVERY PROCESS?
There is some confusion about the terms data mining, knowledge discovery, and knowledge discovery in databases, we first define them. Note, however, that many researchers and practitioners use DM as a synonym for knowledge discovery; DM is also just one step of the KDP.
Data miningwas defined in just add here that DM is also known under many other names, including knowledge extraction, information discovery, information harvesting, data archeology, and data pattern processing.
The knowledge discovery process(KDP), also called knowledge discovery in databases, seeks new knowledge in some application domain. It is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The process generalizes to non database sources of data, although it emphasizes databases as a primary source of data. It consists of many steps (one of them is DM), each attempting to
complete a particular discovery task and each accomplished by the application of a discovery method. Knowledge discovery concerns the entire knowledge extraction process, including how data are stored and accessed, how to use efficient and scalable algorithms to analyze massive datasets, how to interpret and visualize the results, and how to model and support the interaction between human and machine. It also concerns support for learning and analyzing the application domain.
This defines the term knowledge extractionin a narrow sense. While the acknowledge that extracting knowledge from data can be accomplished through a variety of methods some not even requiring the use of a computer
uses the term to refer to knowledge obtained from a database or from textual data via the knowledge discovery process.
STEPS OF THE KNOWLEDGE DISCOVERY IN DATABASES PROCESS
Data mining is actually the core step in Knowledge Discovery in Databases (KDD) process. Though KDD is used synonymously to represent data mining, both these are actually different. Some preprocessing steps before data mining and post processing steps after data mining are to
be completed to transform the raw data as useful knowledge. Thus, data mining alone might not give you what you actually look for.
Fig 2. Data Mining process
KDD is an iterative process that transforms raw data into useful information. Different steps of Knowledge iscovery in Databases are:Understanding: The first step is understanding the requirements. We need to have a clear understanding about the application domain and your objectives, whether it is to improve your sales, predict stock market etc. It should also know whether you are going to describe your data or predict information.Selection of data set: Data mining is done on yourcurrent or past records. Thus, you should select a data set or subset of data, in other words data samples, on which you need to perform data analysis and get useful knowledge. We should have enough quantity of data to perform data mining.
Data cleaning is the step where noise and irrelevant data are removed from the large data set. This is a very important preprocessing step because your outcome would be dependent on the quality of selected data. As part of data cleaning, you might have to remove duplicate records, enter logically correct values for missing records, remove unnecessary data fields, standardize data format, update data in a timely manner and so on.
With the help of dimensionality reduction or transformation methods, the number of effective variables is reduced and only useful features are selected to depict data more efficiently based on the goal of the task. In short, data is transformed into appropriate form making it ready for data mining step.
Selection of data mining task
Based on the objective of data mining, appropriate task is selected. Some common data mining tasks are classification, clustering, association rule discovery, sequential pattern discovery, regression and deviation detection. We can choose any of these tasks based on whether we need to predict information or describe information.
Selection of data mining algorithm
Appropriate method(s) is to be selected for looking for patterns from the data. You need to decide the model and parameters that might be appropriate for the method. Some
popular data mining methods are decision trees and rules, relational learning models, example based methods etc.
Data mining is the actual search for patterns from the data available using the selected data mining method.
This is a post processing step in KDD which interprets mined patterns and relationships. If the pattern evaluated is not useful, then the process might again start from any of the previous steps, thus making KDD an iterative process.
This is the final step in Knowledge Discovery in Databases (KDD). The knowledge discovered is consolidated and represented to the user in a simple and easy to understand format. Mostly, visualization techniques are being used to make users understand and interpret information.
Though these are the main steps in any KDD process, some of the steps could be done combined during the actual process. For example, considering the convenience, data selection and data transformation can be combined together. Even after presenting knowledge to the user, new data can be added to the data set or mining can be further refined or a different data mining method can be chosen to get more accurate results. Thus, KDD is completely an iterative process.
When we analyze different steps of KDD process, we could understand that we are mining data to get useful information or knowledge. Thus, knowledge mining would be the more appropriate term rather than data mining.
KNOWLEDGE DISCOVERY PROCESS
Although the models usually emphasize independence from specific applications and tools, theycan be broadly divided into those that take into account industrial issues and those that do not.However, the academic models, which usually are not concerned with industrial issues, can bemade applicable relatively easily in the industrial setting and vice versa. We restrict our discussion to those models that have been popularized in the literature and have been used in real knowledgediscovery projects.
ACADEMIC RESEARCH MODELS
The efforts to establish a KDP model were initiated in academia. In the mid-1990s, when the DMfield was being shaped, researchers started defining multistep procedures to guide users of DMtools in the complex knowledge discovery world. The main emphasis was to provide a sequenceof activities that would help to execute a KDP in an arbitrary domain. The two process modelsdeveloped in 1996 and 1998 are the nine-step model by Fayyad et al. and the eight-step modelby Anand and Buchner. Below we introduce the first of these, which is perceived as the leadingresearch model. The second model is summarized
KNOWLEDGE DISCOVERY PROCESS
The Fayyad et al. KDP model consists of nine steps, which are outlined as follows:
Developing and understanding the application domain. This step includes learning the relevantprior knowledge and the goals of the end user of the discovered knowledge.
Creating a target data set. Here the data miner selects a subset of variables (attributes) anddata points (examples) that will be used to perform discovery tasks. This step usually includesquerying the existing data to select the desired subset.
Data cleaning and preprocessing. This step consists of removing outliers, dealing with noiseand missing values in the data, and accounting for time sequence information and knownchanges.
Data reduction and projection. This step consists of finding useful attributes by applyingdimension reduction and transformation methods, and finding invariant representation ofthe data.
Choosing the data mining task. Here the data miner matches the goals defined in Step 1 witha particular DM method, such as classification, regression, clustering, etc.
Choosing the data mining algorithm. The data miner selects methods to search for patterns inthe data and decides which models and parameters of the methods used may be appropriate.
Data mining. This step generates patterns in a particular representational form, such as classificationrules, decision trees, regression models, trends, etc.
Interpreting mined patterns. Here the analyst performs visualization of the extracted patternsand models, and visualization of the data based on the extracted models.
Consolidating discovered knowledge. The final step consists of incorporating the discoveredknowledge into the performance system, and documenting and reporting it to the interestedparties. This step may also include checking and resolving potential conflicts with previouslybelieved knowledge.
Notes: This process is iterative. The authors of this model declare that a number of loops betweenany two steps are usually executed, but they give no specific details. The model provides a detailed
technical description with respect to data analysis but lacks a description of business aspects. Thismodel has become a cornerstone of later models.
Major Applications: The nine-step model has been incorporated into a commercialknowledge discovery system called MineSet(for details, see Purple Insight Ltd)The model has been used in a number of different domains,including engineering, medicine, production, e- business, and software development.
Industrial models quickly followed academic efforts. Several different approaches were undertaken,ranging from models proposed by individuals with extensive industrial experience tomodels proposed by large industrial consortiums. Two representative industrial models are thefive-step model by Cabena et al., with support from IBM and the industrialsix-step CRISP-DM model, developed by a large consortium of European companies. The latterhas become the leading industrial model, and is described in detail next.
The CRISP-DM (Cross-Industry Standard Process for Data Mining) was first established inthe late 1990s by four companies: Integral Solutions Ltd. (a provider of commercial data miningsolutions), NCR (a database provider), DaimlerChrysler (an automobile manufacturer), and OHRA(an insurance company). The last two companies served as data and case study sources.
The development of this process model enjoys strong industrial support. It has also beensupported by the ESPRIT program funded by the European Commission. The CRISP-DM SpecialInterest Group was created with the goal of supporting the developed process model. Currently,it includes over 300 users and tool and service providers.
THE KNOWLEDGE DISCOVERY
The CRISP-DM KDP model consists of six steps, which are summarizedbelow:
Business understanding. This step focuses on the understanding of objectives and requirementsfrom a business perspective. It also converts these into a DM problem definition, and designsa preliminary project plan to achieve the objectives. It is further broken into several substeps,namely,
determination of business objectives,
assessment of the situation,
determination of DM goals, and
generation of a project plan.
Data understanding. This step starts with initial data collection and familiarization with thedata. Specific aims include identification of data quality problems, initial insights into the data, and detection of interesting data subsets. Data understanding is further broken down into collection of initial data,
description of data,
exploration of data, and
verification of data quality.
Data preparation. This step covers all activities needed to construct the final dataset, whichconstitutes the data that will be fed into DM tool(s) in the next step. It includes Table, record,and attribute selection; data cleaning; construction of new attributes; and transformation ofdata. It is divided into
selection of data,
cleansing of data,
construction of data,
integration of data, and
formatting of data substeps.
Modeling. At this point, various modeling techniques are selected and applied. Modeling usuallyinvolves the use of several methods for the same DM problem type and the calibration of their parameters to optimal values. Since some methods may require a specific format for input data,often reiteration into the previous step is necessary. This step is subdivided into selection of modeling technique(s),
generation of test design,
creation of models, and
assessment of generated models.
Evaluation. After one or more models have been built that have high quality from a dataanalysis perspective, the model is evaluated from a business objective perspective. A reviewof the steps executed to construct the model is also performed. A key objective is to determinewhether any important business issues have not been sufficiently considered. At the end of thisphase, a decision about the use of the DM results should be reached. The key substeps in thisstep include
evaluation of the results,
process review, and
determination of the next step.
Deployment. Now the discovered knowledge must be organized and presented in a way thatthe customer can use. Depending on the requirements, this step can be as simple as generatinga report or as complex as implementing a repeatable KDP. This step is further divided into plan deployment,
plan monitoring and maintenance,
generation of final report, and
review of the process substeps.
In this paper,the characteristics ofData Mining of knowledge iswere studied. We have concentrated here on different angles of KDD mean, KDD process, Academic Research Models, Steps of Knowledge Discovery in Database, Knowledge Discover Process,Industrial Model, Knowledge discovery process.
RDB2RDF Working Group, Website: http://www.w3.org/2001/sw/rdb2rdf/ ,
charter: http://www.w3.org/2009/08/rdb2rdf-charter, R2RML: RDB to RDF Mapping Language:http://www.w3.org/TR/r2rml/
LOD2 EU Deliverable 3.1.1 Knowledge Extraction from Structured
Sources http://static.lod2.eu/Deliverables/deliverable- 3.1.1.pdf
"Life in the Linked Data Cloud". www.opencalais.com. Retrieved 2009-11-10. Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia but translated into a machine-readable format.
Tim Berners-Lee (1998), "Relational Databases on the Semantic Web". Retrieved: February 20, 2011.
Hu et al. (2007), Discovering Simple Mappings Between Relational Database Schemas and Ontologies, In Proc. of 6th International Semantic Web Conference (ISWC 2007), 2nd Asian Semantic Web Conference (ASWC 2007), LNCS 4825, pages 225238, Busan, Korea, 1115 November 2007. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10. 18.104.22.16834&rep=rep1&type=pdf
R. Ghawi and N. Cullot (2007), "Database-to-Ontology Mapping Generation for Semantic Interoperability". In Third International Workshop on Database Interoperability (InterDB 2007).http://le2i.cnrs.fr/IMG/publications/InterDB07- Ghawi.pdf
Li et al. (2005) "A Semi-automatic Ontology Acquisition Method for the Semantic Web", WAIM, volume 3739 of Lecture Notes in Computer Science, page 209-220. Springer.http://dx.doi.org/10.1007/11563952_19 .