Comparative Analysis of Different Clusteirng Techniques for Data Analytics

Navneet Kaur; Dr.  Gagandeep

doi:10.17577/IJERTV4IS041394

Volume 04, Issue 04 (April 2015)

Comparative Analysis of Different Clusteirng Techniques for Data Analytics

DOI : 10.17577/IJERTV4IS041394

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 49
Total Downloads : 151
Authors : Navneet Kaur, Dr. Gagandeep
Paper ID : IJERTV4IS041394
Volume & Issue : Volume 04, Issue 04 (April 2015)
DOI : http://dx.doi.org/10.17577/IJERTV4IS041394
Published (First Online): 29-04-2015
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comparative Analysis of Different Clusteirng Techniques for Data Analytics

Dr. Gagandeep

Assistant Professor, Department of Computer Science,

Punjabi University, Patiala (PUNJAB)

Navneet Kaur

M.Phil Scholar, Department of Computer Science,

Punjabi University, Patiala (PUNJAB)

Abstract- The basic principle of data mining is to analyze the data and find out the useful information from it. Data mining is the basic stage of knowledge Discovery in Databases process. Clustering or cluster analysis is part of data mining. Data mining software allows users to analyze the data. This paper introduces the key principle of clustering of WEKA tool. WEKA is data mining tool used to analyze the data. It provides the facility to clustering the data through various algorithms.

Keywords Cluster Analysis, Clustering, Data Mining.

INTRODUCTION

We live in the world of data. A large number of users are generating the data everyday by various sources including cell phones, E- mails, attaching files, online chat, uploading pictures or videos, online shopping, e-banking etc. Big data can also refers to a collection of large data set that is increasing day by day and it cannot be handled by traditional data managing techniques like DBMS, RDBMS etc. Big data has five characteristics. These are also known as 5 Vs of big data. The characteristics are used to understand the nature of big data. These are volume, variety, velocity, veracity and value. Along with these characteristics, big data also faces some challenges. These challenges are privacy and security. It means that it is a challenge to secure the data. Next challenge is information sharing. Third one is analytical challenge, means how the data can be analyzed if it increases continuously. What type of techniques and tools are used to analyze the data. Next challenge is human resource and manpower. Technical challenge includes quality of data, fault tolerance and heterogeneous data.

If we are talking about an analytical challenge then it is very necessary to analyze the data so that the useful information can be retrieved from the data. Various analyzing techniques are available like data mining, data clustering, machine learning, text analytics association rule learning and classification but clustering technique helps to dividing the structured data into the groups. WEKA software tool is the data analyzing tool that helps in classification, clustering and visualization of the data. WEKA can be considered as one of the best method of data analyzing tool.
CLUSTERING ANALYSIS

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in same group (cluster) is more similar to each other than to those in other groups (clusters). Or clustering is a technique for finding similarity groups in data, called cluster. In this process objects with same features are collected in one group (cluster). Special property is the objects are more similar to one another with in the same cluster and dissimilar to the objects in other clusters. It is a main task for statistical data analysis, used in many fields including machine learning, pattern recognition, image analysis, information retrieval etc.

Clustering analysis itself is one specific algorithm, but general task is to be solved. By using clustering techniques we can identify even denser and sparse regions in object space and can also discover overall distribution pattern and correlations between data attributes.

The simple example of clustering is library system. In which books are related to different subjects. Each subject has its own section (cluster) and particular subject is further categorized into subgroups. Like computer applications subject is further subdivided into networking, software engineering, programming languages etc. and according to each subgroups books are arranged.
TYPES OF CLUSTERING

The clustering analysis methods or clustering may be categorized into following types:

1. Centroid Based Clustering: In the centroid based clustering, clusters are represented by a central vector, which may not necessarily of the dataset. It is assumed that each cluster has at least one object and each object belongs to only one cluster. Its main article is K-Means clustering.
1. Distribution Based Clustering: The clustering model most related to statistics is based on distribution model. Clusters can then easily be defined as objects belonging most likely to the same distribution. Example: Expectation Maximization.
2. Density Based Methods: In density based clustering, clusters are defined as area of higher density than the remainder of the data set. Objects in these sparse areas- that are required to separate clusters are usually considered to be noise and border points.
  
  The most popular density based clustering method is DBSCAN.
3. Hierarchical Clustering: Hierarchical method obtains a nested partition of the objects resulting in a tree of clusters. These methods either, start with one cluster and then split into smaller clusters or start with each object in an individual cluster and try to merge similar clusters into large and large clusters. Two types of hierarchical methods are Divisive and Agglomerative.
4. Grid Based Clustering: In this, the object space rather than the data is divided into grid. Grid partition is based on characteristics of the data and such methods can deal with non-numeric data more easily and not affected by data ordering.
5. Model Based Clustering: This model is based on a probability distribution. The algorithm tries to build clusters with a high level of similarity within them and a low level of similarity between them. Similarity measurement is based on the mean values and the algorithm tries to minimize the squared- error function.
WEKA : DATA MINING TOOL

WEKA (Waikato Environment for Knowledge Analysis) is popular suite of machine learning software written in Java, developed at the University of Waikato, New Zeeland. The WEKA is an endemic bird of New Zeeland. The timelines of the various stages of Weka developed history are as follows:
- Late 1992 funding was applied by Ian Witten
- 1993- developed of the interface and infrastructure
- Sometimes in 1994- first internal release of Weka
- October 1996- first public release of Weka (v2.1)
- July 1997- Weka 2.2
- Early 1997- decision was made to rewrite Weka in Java.
  - Originated from code written by Eibe Frank for his Ph. D.
  - Originally codenamed JAWS (Java Weka System)
- May 1998- Weka 2.3
- Mid 1999- Weka 3(100% Java) released
  
  The main reason why selected to use WEKA was because of its visibility. WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from users own Java code. WEKA is open source and freely available.
  
  Fig.1 Interface of WEKA Tool
  
  Wekas four Application Interaface:
- Explorer: An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). Its functions are pre- processing, attribute selection, visualization, classification and clustering.
- Experimenter: it is an environment for performing experiments and conducting statistical tests between learning schemes.
- Knowledge Flow: This environment supports essentially the same functions a the Explorer but with a drag and drop interface. One benefit is that it supports incremental learning. It explains visual design of data flow.
- Simple CLI: it is command line interface allows direct execution of Weka commands for operating systems that do not provide their own command line interface.
  
  Features of Weka Tool:
  
  Weka is a GUI for where various applications are available. Some of the features of Weka tool are described as:
- Data Pre-processing
- Data Classification
- Data Clustering
- Attribute Selection
- Data Visualization
Advantages of Weka
- They have freely availablity under the GNU(General Public Licence)
- Due to its portability, it is fully implemented in Java programming language and runs on almost any architechture.
- It is easily use due to its graphical user interface.
- Large collection of data pre-processing and modelling techniques.
- It is easy usable by people who are not data mining specilist.
DATASET

For performing the comparison analysis we need the past project dataset. For this research Pima _ diabetes dataset has been taken from officially website of WEKA. Dataset includes 768 instances and 8 attribute with class. The class has two values class 0 and class 1. No missing value is available in dataset. Three clustering methods have been applied on this dataset. We can directly apply this data in data mining tool. Each instance of dataset belongs to a class. On the basis of this class the cluster are generated by applying algorithms using WEKA interface. WEKA is a landmark system in the history of the data mining and machine learning research communities.

Pima_diabetes Dataset The diagnostic, binary-valued investigated is whether the patient shows the sign of diabetes (1 is interpreted as tested positive and 0 is interpreted as tested negative)

Fig 2. Weka Explorer With Dataset Results

Weka explorer results of dataset in figure 2. In this the Relation, Instances Attribute and visualization results have been shown. 500 and 268 are two values of Class attribute. In above result Class attribute is selected and its information has been shown in selected attribute part of result. In this Distinct value shows that class attribute has 2 distinct values i.e. 500 (Tested Negative) and 268 (Tested Positive). Type of the selected attribute has been defined by Type.
Results of DBSCAN

Fig. 5 DBSCAN Algorithm

Fig. 6 Result of DBSCAN Algorithm

Table 5: Calculations of DBSCAN

Table 6: Clusters Values
COMPARISON RESULTS OF DIFFERENT ALGORITHMS

CONCLUSION

Epsilon	0.09
Min Points	6
No. of Generated clusters	2
Time taken to build model	0.28 seconds

Clusters Values
0	268(35%)
1	500(65%)

Parameters	K-Means	EM Technique	DBSCAN
TIME	0.02 Sec.	10.02 Sec.	0.28 Sec.
RATIO	65:35	30:26:44	35:65
SCAN / ITERATION	4	No Scan is defined	One Scan
CLUSTER	Centroid	Statistical	Density
NUMBER OF CLUSTERS	2	3	2
NOISE HANDLING	NO	NO	YES

In recent few years data mining technique for analyzing data covers every area in our life. Data mining techniques are used in medical, banking, insurances, education etc. The main aim of this paper is to provide a detailed introduction of Weka clustering algorithms. Weka is the data mining tool mainly used for data analyzing. With the help of figures working of Weka clustering algorithms has been shown. Every clustering algorithm has its own importance. Various parameters have been used for the comparison of clustering algorithm. If we conclude the results, K-Means clustering algorithm takes lesser time to build the model than EM and DBSCAN. Ratio is correctly classified in K-Means and DBSCAN algorithm than EM Technique. Total number of scans in DBSCAN algorithm is minimum i.e. 1. Number of clusters is selected by algorithm itself but user can also select it. If we are talking about the sample dataset i.e. pima_ diabetes, results of K- Means algorithm is more accurate than EM and DBSCAN because K-Means takes less time and correctly classified the ratio. Moreover, there is no missing value which creates difficulty in clustering technique. Weka is more suitable tool for data mining applications because there is no need of deep knowledge of algorithms for Weka. This paper shows only clustering operations and the comparison of clustering algorithms of Weka.

REFERENCES

A.Laud and R.Shah,An Enhance Approach For Cluster Analysis For Large datasets,(IJERT) vol.2 Issue 6,June 2013.
A.Katal, M.Wazid and R H Goudar,Big Data: Issues, Challenges, Tools and Good Prctices, IEEE (2013) pp. 404-409.
A. Kumar Jain and Satyam Maheshwari,Survey of Recent Clustering Techniques in Data Mining,(IJCSMR) vol.1, issue 1,pp.72-78. Augut 2012.
Han, Wen Yonggang and Xuelong Li, Towards Scalable systems for big data analytics: A Technology Tutorial. IEEE Access. Vol. 2,pp 652- 687, July 2014.
N.Bhan and D. Mehrotra,Comparative Study of EM and K- Means Clustering Techniques in Weka interface.(IJATER) Vol.3, Issue 4, July 2013.
N.Choudhary and P.Singh,Cloud Computing and Big Data Analytics, (IJERT) vol. 2 Issue 12, pp 2700-2704, December 2013.
N. Sharma and A. Partap Singh,A Comparative Study of Data Clustering Techniques,(IJERT) vol.2 Issue 6 , June 2013.
P. Chandarana and M. Vijayalakshmi,Big Data Analytics FrameworksIEEE. pp 430-434, 2014
R.Jindal, S. D. Sharma and M.Sharma,Survey Paper on Different Techniques of Measuring Efficiency of Clustering, (IJERT) vol.2 , Issue 6. pp 1373-1376, June 2013.
S.Sarangi and V. Jaglan,Performance Comparison of Mchine Learning Algorithms on integration of Clustering and Classification techniques.(IJETCAS,2013) pp.251-257.
S. Siddiqui and D. Gupta ,Big Data Analytics: A Srvey, International Journal of Emerging Research in Management & Technology, vol. 3, issue 7. pp 117-123, July 2014.
S. Singhal and M.Jena, A Study of Weka Tool for data Processing, Classification and Clustering,IJERT,vol.2, Issue 6 pp.250-253. August 2013.

No. of Iterations	4
Time taken to build model	0.02 Seconds
Within cluster sum of squared errors	149.52

No. of Cluster selected by cross validation	3
Time taken to build model(full data model)	10.02 Seconds
Log likelihood	24.97

0	228(30%)
1	203(26%)
2	337(44%)

Comparative Analysis of Different Clusteirng Techniques for Data Analytics

Leave a Reply