An Analytical Review of Data Mining Tools

DOI : 10.17577/IJERTV4IS040611

Download Full-Text PDF Cite this Publication

Text Only Version

An Analytical Review of Data Mining Tools

Igiri Chinwe Peace Department of Computer Engineering Rivers State College of Arts and Science

Port Harcourt Nigeria

AbstractData mining plays a vital role in the contemporary society and the corporate world as a whole. This paper reviews a number of different data mining tools including Environment for Knowledge Analysis (WEKA), Konstanz Information Miner (KNIME), GhostMiner, R Analytical Tool To Learn Easily (Rattle) and RapidMiner. More often than not, young researchers face the challenge of making choice of a data mining tool to carry out their research. An evaluation of the capabilities, attributes, as well as sources has also been done in this paper. The strengths and weakness of these tools has also been explored. It was established herein, that Waikato Environment for Knowledge Analysis (WEKA), Konstanz Information Miner (KNIME), R Analytical Tool To Learn Easily (Rattle) and RapidMiner are open source data mining tools and are provided under the GNU GPL licenses while GhostMiner is commercial.

KeywordsClassification, Clustering, Data mining; Open source.

  1. INTRODUCTION

    The late 20th century and the 21st century have witnessed a growing importance of information technology both in the social and corporate environment. In this regard, different aspects of information technology are needed in decision support, business performance management, query and reporting analytical processing and predictive analysis among others. One area of information technology that has witnessed a growing importance in decision support and predictive analysis in the current world is data mining. Research and studies indicate that data mining plays a significant role in helping businesses evaluate data and as such make informed decisions regarding different aspects of their processes and operations, [4]. Speaking from this perspective, this paper will evaluates a number of data mining tools. These include Waikato Environment for Knowledge Analysis (WEKA), Konstanz Information Miner (KNIME), GhostMiner, R Analytical Tool To Learn Easily (Rattle) and Rapid Miner.

  2. CRITICAL REVIEW OF DATA MINING TOOLS

    1. Waikato Environment for Knowledge Analysis (WEKA)

      WEKA is one of the most recognized data mining and machine language software. In reference to [3], WEKA was established in 1992 under the funding of the New Zealand government. This software is renowned for its data mining capabilities and is based on JAVA programming platform. One of the main capabilities of WEKA is data pre-processing. With this in mind, WEKA is able to transform raw data into a format that is understandable. According to [3], this data mining tool has a wide array of data pre-processing tools or

      rather filters that enable users to perform different functions on data. These filters or tools include Add classification, Add ID, Add values, Attribute reorder, Interquartile range, Kernel filter, Numeric cleaner, Numeric to nominal, Partitioned multi-filter, Propositional to multi-instance and vice versa, Random subset, RELAGGS, Reservoir sample, Subset by expression and Wavelet. Similarly, the software has data classification capabilities. To achieve this, the software uses four main steps to classify data namely data preparation, selecting classify and applying algorithm, generating trees and analyzing the output or results, [11]. In addition to this, WEKA has data clustering capabilities, through unsupervised algorithms. Among the methods that WEKA uses to cluster data include K-means clustering, hierarchical clustering and density based clustering. Similarly, this software supports several data files including ARFF format, CSV, LibSVMs format, and C4.5s format, while WEKA 3.6 supports import of PMML regression, [3]. Moreover, it can be able to read data from databases that supports JDBC driver, files and URLs. With regard to its user interface, WEKA data mining software has four main features namely the Explorer, the Experimenter, Knowledge flow and Simple CLI. Similarly, WEKA has extensibility capabilities that are implemented using 3 plugins. Importantly, WEKA is an open source, is distributed freely under the GNU General Public License, and if fully portable.

    2. RapidMiner

      RapidMiner is a Data Mining Suites (DMS) data mining tool. While most DMS data tools are commercial, RapidMiner is an open source software, and is distributed under the GNU Affero General Public License (AGPL). RapidMiner is Java- based and has capability for web mining, data mining and text mining, [10]. RapidMiner has an array of features. In reference [9], some of the feature of this data mining software include multiple user interfaces that are easy to use, easy accessibility and managing of data, model import and export, an operator for applying models to data sets known as scoring, data partitioning, data replacement, graphs and visualization, attribute generations and Bayesian modelling among others. Similarly, [10] notes that some of the text mining operators include tokenization, extraction, stemming, transformation, utility and other operators such as creating, writing, reading and processing documents. In this case, RapidMiner has text mining and processing capabilities. With reference to [7], RapidMiner has more than 400 operators for data mining in Java. In addition to this, this software has a user friendly graphical user interface (GUI), drag-and-drop features, and

      has options with application wizard that help process data automatically depending on the project objectives. For example, this software can help process data based on churn analysis, direct marketing, and sentiment analysis. Importantly, while RapidMiner is an open source data mining software, it also have a number of versions that are commercial available.

    3. Konstanz Information Miner (KNIME)

      Konstanz Information Miner (KNIME) is an open source data mining and data analytics, integration and reporting platform, and it is based on Eclipse and written in Java. KNIME is released and provided under GPLv3 license [2]. For users who need dedicated support, there is a commercial version of KNIME, which is more appropriate for large businesses and organizations. Nonetheless, most users would still find the free version of this software useful in their data mining activities. Importantly, the software has a graphical user interface that helps users to easily and quickly assemble nodes for data processing, that is, extraction, transformation and leading (ETL), in order to model, pre-process and visualize data [2]. KNIME has capabilities that enable loading of data, processing, analysis, transformation and visual exploration module that does not necessarily concentrate on any area of the application. Some of the main capabilities of KNIME include frequent updates, versatility and adaptability. According to [6], KNIME exhibit its capability in adaptability through its ability to work with data prepared in other software such as Python, R and Java. Whereas this is the case, it is vital to note KNIMEs capability to read data from sources is limited to ARFF files, that is, the WEKA files, databases and text-delimited files, which comprise CSV files [1]. KNIME has no capability to read data from Microsoft Excel files, LIBSVM files, and from SVMlight files. In addition, it does not use SPSS files. Speaking from this perspective, individuals who are already using LIBSVM and SVMlight file formats will experience difficulties in using this software.

    4. GhostMiner

      GhostMiner is a data mining software that was developed by Fujitsu and is available commecially. According to [8], GhostMiner supports an array of data and data formats including mature machine learning algorithm and databases (including spreadsheets). The software plays a vital role in data selection and preparation, visualization, validating of models and multimodels. There are a number of key features of this software. To begin with, GhostMiner has the capability

      tools, data modelling tools and methods such as cross- validation and Xtest to evaluate the accuracy of data under study. However, this tool is limited when dealing complex data mining process and it has limited support as compared to similar products from companies such as IBM and Oracle. Therefore, the software is recommended for small and midsize users rather than large companies.

    5. R Analytical Tool To Learn Easily (Rattle)

      Rattle is an open source data mining software that is written in R programming language and provides a link into R, and is commercial. According to [13], Rattle utilizes a Gnome graphical user interface, which is implemented through the RGtk2 package. Importantly, this softwares graphical user interface was created using an interactive interface builder, Glade, and as such, the softwares interface is independent of the XML description. Rattle supports a number of data formats such as CSV, TXT, ARFF, R datasets and ODBC databases connection via a number of data sources such as MySQL, Oracle, Teradata, SQLite, SQL Server, IBM DB2, Microsoft Access and Excel and Postgress among others [13]. Similarly, it is essential to note that Rattle has two refined tools, Latticist and GGobi, which support interactive graphical analysis of data. On one hand, the GGobi tool plays a significant role in exploring high-dimensional data using its highly interactive and dynamic graphics such as bar charts, tours, parallel coordinate plots and scatterplots [13]. Importantly, GGobi need to be installed separately on the systems and supports environments such as Windows, OS/X and Linux. On the other hand, the Latticist package uses a graphical interface with lattice graphics to analyze data, and the tools supports data selections annotations, plots, brushing and sub setting among others [13]. In addition to this, the Rattle is able to perform a number of functions including selecting, exploring, graphics, transformation, clustering, associating, modelling and evaluation of data.

  3. DISCUSSION

    The capabilities and features of these tools are as follows:

    1. Programming language

    • WEKA Java

    • RapidMiner Java

    • KNIME Java

    • GhostMiner- Not specified

    • Rattle- R

      to process data from a number of file formats including2. Capabilities

      spreadsheets, text and ASCII files. Importantly, the software has attributes that enable its users to develop simple user interfaces depending on their data mining needs [12]. Similarly, some of its main capabilities include data visualization and data pre-processing capabilities. The data pre-processing capabilities in GhostMiner enable this software to conduct most statistical functions in preliminary statistical data analysis such as calculating the mean and median across the whole data on text, ASCII and database files, [12]. With regard to visualization, GhostMiner has the capability to show how a set(s) of data is related to the other. In addition to these capabilities, this data mining software has a range of statistical

    • WEKA Data preprocessing, Classification, Data clustering etc.

    • RapidMiner Pre-processing, Classification, text mining, visualization, application wizard.

    • KNIME- Data transformation, visualization, frequent data update, versatility etc.

    • GhostMiner Pre-processing, visualization, model validation etc.

    • Rattle Data transformation, association, classification , clustering, interactive graphical analysis, etc.

      1. Supported file format

    • WEKA ARFF, CSV, C4.5, etc.

    • RapidMiner CVS, DB, Spreadsheet, SPSS etc.

    • KNIME ARFF

    • GhostMiner Spreadsheet, text and ASCII files.

    • Rattle CVS, ARFF, R dataset, ODBC database.

      1. Supported data source

    • WEKA – URLs, Database, JDBC, etc.

    • RapidMiner URLs, database, SPSS, etc.

    • KNIME None specified

    • GhostMinerNone specified

    • Rattle MySQL, Oracle, MS Access, MS Excel, etc.

      1. Available sources

    • WEKA Free open source

    • RapidMiner Free and commercial open source

    • KNIME Free open source

    • GhostMiner Commercial

    • Rattle – Commercial

    Table1: Summary of strengths and weaknesses of the data mining tools.

    Data tool

    mining

    Strengths

    Weaknesses

    WEKA

    Rapid Miner

    KNIME

    knowledge

    GhostMiner

    data modeling techniques

    Rattle

    • Robust data mining techniques

    • Support web services

    • Lack of colourful visualization

    • User friendly

    • Application wizard

    • High level visualization support

    • Support web services

    • Need for programming knowledge to use to effectively use the software

    • Adaptability

    • No need for programming

    • Can only support ARFF file format

    • No support for web services

    • Wide range of statistical and

    • Cannot not handle complex data

    • Only for small or medium business

    • Active user group

    • Not meant for novice or learners

  4. CONCLUSION

In conclusion, data mining plays a critical role in the modern day decision making process in the business environment. There are a number of data mining software that business can utilize to process and analyze data. These include Rattle, GhostMiner, KNIME, RapidMiner and WEKA. Arguably, while Rattle, KNIME, RapidMiner and WEKA are open source software and provided for free under the GNU GLP licenses, GhostMiner is commercial and can only be obtained at a fee. In addition, while Rattle, KNIME, RapidMiner and WEKA can perform high level data mining

functions, GhostMiner is limited to low level data mining functions. A summary of the strengths and weakness of these data mining tools is shown in table 1.

REFERENCES

  1. Barcaroli G., Bergamasco S., Jouvenal M., Pieraccini G., Tininini

    L. Generalised software for statistical cooperation, 2008, [Online]. Available from:

    www.researchgate.net/profile/Giulio_Barcaroli/publication/263923 216_Generalised_soft ware_for_statistical_cooperation/links/0f31753c52fc97f19d000000

    .pdf [Accessed 29 March 2015].Akhtar, F. and Hahne, C. Rapid Miner 5 Operator Reference. Rapid-I GmbH, 2012, Retrieved 13:15, February 13, 2015 from: http://rapidminer.com/wpcontent/uploads/2013/10/RapidMiner_Op eratorReference_en.pdf

  2. Bulusu, L. Open source data warehousing and business intelligence. Boca Raton, FL: CRC Press, 2012.

    Peterson, Leif E. "K-nearest neighbor." Scholarpedia 4(2), 1883, 2009.

  3. Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten

    1. H. The WEKA data mining software: anupdate. ACM SIGKDD Explorations Newsletter, 11(1), 1018, 2009.

  4. Joseph, M. V. Significance of data warehousing and data mining in business applications. International Journal of Soft Computing and Engineering, 3(1), 329333, 2013.

  5. Jovic, A., Brkic, K. and Bogunovic, N. An overview of free software tools for general data mining. Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention, 11121117, 2014.

  6. Lin, N. Applied business analytics: Integrating business process, big data, and advanced analytics. New Jersey: Pearson Education Ltd., 2014.

  7. Melin, P., Kacprzyk, J. and Pedrycz, W.Soft computing for recognition based on biometrics. Berlin Heidelberg: Springer Science & Business Media, 2010.

  8. Pierre, T. F. Software applications: Concepts, methodologies, tools, and applications: concepts, methodologies, tools, and applications. Hershey, PA: IGI Global, 2009.

  9. Rapid-I GmbH. (n.d.) Fact Sheet: RapidMiner and RapidAnalytics

    Business analytics fast and powerful [Online]. Available from: http://www.rapidi.com/downloads/brochures/RapidMiner_Fact_She et.pdf [Accessed 29 March 2015].

  10. Shterev, J. Demo: Using RapidMiner for Text Mining. Digital Presentation and Preservation of Cultural and Scientific Heritage, 3, 254256, 2013.

  11. Singhal, S. and Jena, M. A study on WEKA Tool for data pre- processing, classification and clustering. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 2 (6), 250253, 2013.

  12. Wang, J., Hu, X., Hollister, K. and Zhu, D. A comparison and scenario analysis of leading data mining software. in Selected readings on information technology management: Contemporary issues. ed. by Kelley, G. Hershey, PA: IGI Global, 2008.

  13. Williams, G. J. Rattle: A data mining GUI for R. The R Journal, 1(2), 4555, 2009.

Leave a Reply