 Open Access
 Total Downloads : 151
 Authors : Swayanshu Shanti Pragnya
 Paper ID : IJERTV6IS060314
 Volume & Issue : Volume 06, Issue 06 (June 2017)
 DOI : http://dx.doi.org/10.17577/IJERTV6IS060314
 Published (First Online): 27072017
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Clustering: Review on Partitioned Clustering Algorithms
Swayanshu Shanti Pragnya
School of Computer Science and Engineering Centurion University
Jatni, Khurdha Road, 752050
Abstract: – In present generation Big data is the most impenetrable, enliven and most desirable research subject which is leading in all aspects. Unresolved nature of this analysis is the biggest question to all, going to be solved soon. Literally Cluster is the smallest part of the Analysis purpose which somehow fulfil the side by collaboration part of big data. Big data is a term which identifies datasets which are large in terms of size. Big data introduces unequally identified statistical and computational challenges. This paper is to study on the problems which are related to big data, clustering and comparison in between FCM and Kmeans algorithm.
Keyword: Big data, clusters, partitioned clustering, clustering algorithms, FCM, Kmeans.
INTRODUCTION:
In the software field data analysis is the most vital and significant tool as to process voluminous data. To precise and easy way of keeping, extracting, inserting and modifying data, we need some technique. Big data is not only big by size but also starting from variety to veracity, it is vastly enlarged. So for analyzing these big data, any technique is required must. Cluster is the technique which segregate data in their perspective reason. Data is a collaboration of small lexemes with some valid sense and different from other sense. In a few year of time use of data in databases are increased. For making any suitable decision data are collected
And analyzed is the purpose of keeping data. Database is an organized collection of data so that Data can be easily accessed, managed, updated and inserted.
Big Data:
Data is some facts which may be organized or raw, gathered for use. Any block of information which is meaningful and will be used in further can be define as data. Big data is collection of full volume data sets. Nowadays big data is becoming the main focus at every place. Big data is related to everywhere in all aspects.

CHARACTERISTICS OF BIG DATA: Previously big data was characterized as 3Vs but in present
2 more characters are included i.e. value and veracity. Variety: It defines as data in the form of structured, UN structured, semi structured, probabilistic, multi factor or statistical manner. Basically collection of different type of data.

Value: Value represent a vital role in case of big data. As it means data must have some statistical or hypothetical importance with proper meaning.

Velocity: Most often big data are generated at a high speed by including arrays, sensors or multiple events and need to process in near real time or batch.

Veracity: This dimension of big data includes 2 aspects i.e. Consistency or certainty and trust worthiness with authentication.

Volume: In big data volume indicates to the size, scale, dimension and amount of any data or data sets. Ex more no. of records, tables and file transaction.
III . LIMITATIONS IN BIG DATA:
Heterogeneity and Incompleteness: Though many types of data sets are combined together which leads to heterogeneity. Here working with heterogonous data leads to more challenging
SECURITY AND PRIVACY CHALLENGES:
This is the measure problem seen in case of big data. When personal information combined with such a large scale of data leads to complexity in their privacy of data. Quality of data decreases due to improper utilization.
Fault Tolerance: Two methods which increase this tolerance in big data are

Divide whole computation being done by tasks and assign these tasks into different nodes for computation

One node is assigned the work of observing that these nodes are working properly. If something happens that particular task restarted.[1]
But sometimes whole computation cant be divided into independent tasks which lead to fault tolerance.
Endogeneity: It refers to a feedback loop in a system. We can model and predict ants or algae far better than we can model and predict markets.Clustering is one of the solution in big data issues.

CLUSTERING:
Fig 2 Showing Clustering example [2]
Cluster: Cluster can be defined as group of similar or very similar objects. In a computer system a group of servers and other resources that act alike single system is known as cluster.
Clustering is an ongoing process or a under supervision technique of data mining. Clustering means grouping of similar objects together and separating dissimilar ones.
In this figure we find that 4 clusters where data can be easily divided. But here we make clusters according to the nearest distance from each other i.e. known as Distance based clustering. Here objects are grouped together according to their descriptive fit concept not by simple similarity measures.[2]

TYPES OF CLUSTERING:

Partitioned Clustering:
Here classification is done equally. Partitioned algorithms divide or classify the data set in mutually disjoint partitions
Advantages:

Simple and Scalable

Suitable for almost all data sets
Advantages:

Problems involving point linkages are well suited. Ex taxonomy trees.

Flexibility in embedded regarding the level of granularity.
Disadvantages:

Making correction after splitting is not in able.

Interpretability in cluster descriptors are poor
It basically uses 2 parameters Epsilon and minimum points of each cluster and one least point.
Advantage: no need to specify the number of clusters in advance and easily handle cluster with arbitrary shape.
Disadvantage: Not handling the data points with varying densities and results depend on the distance measures.[2]
Major Benefits of cluster are High performance, scalability, system availability, and major issues are scalable performance(scaling of resources), availability support, Fault tolerance and recovery(cluster machines are designed to eliminate all single point failures ).
Basically a process of segmenting similar and dissimilar one.

Spherical clusters which are well in shape. Disadvantage:

Clusters descriptors are very poor

Phase initialization is very sensitive.

Dealing with non convex clusters are varying in density and size

Degradation in high dimensional spaces.

Entrapment into local optima is very frequent


Hierarchical Clustering:

Hierarchical clustering algorithms divide datasets in hierarchy clusters.
It can be done by 2 approaches i.e.
Topdown: It is complex due to absence of subroutine. It is more efficient and accurate
And Bottomup.: It is not complex and less efficient with accurate.

Massive datasets

Expensive

Effectiveness degradation in high dimension spaces due to dimensionality phenomenon.


DENSITY BASED CLUSTERING:
It is discovering the cluster which are arbitrary in shapes and the noise in a spatial database.
Clustering can be done by following certain algorithms. The algorithms are completely depended upon the following properties are:

Type of attribute handled by algorithm

Complexity

Size of data and database

Detection of Outlier : Finding very similar object
/li>

Dependency in ordering tuples in database

Choosing initial cluster

Calculation of center.s
Clustering Benefits:
In case of Linux clustering server:

Completely a Scalable solution. Addition of resources are allowed.

If a cluster server needs any maintenance, it can be done by stopping it while handling the load over to other servers.

It is reliable and easy to configure. In case of Oracle:

The value of clustered key is stored once which reduces memory.

Query retrieval is very fast. Server Clustering:

If one server is having some problem and did not solved then it will take solution from cluster and dont affect to another server.

Resource addition is easy. SQL server:

Only one instance run over the server.

Usage of hardware is full


Limitations:
The potential problems with clustering are:

Distance measure identification: For numerical attributes distance measurement is difficult. In case of Murkowski distance where maximum distance calculation is found by numerical attribute.

Number of clusters: If number of class labels are unknown then number of cluster finding is not easy.

Class label lacking

Database structure: In real life databases are not having sufficient structured tuples which lead to missing of data. If data cannot be structured properly then clarification and completeness will not occur.

Different attributes in Database: Database may not contain distinctively numerical or categorized attributes.

High cost: Though cluster needs proper hardware and a design which is costly compared to non clustered server management.

Clustering needs more h/w and severs to establish one, monitoring and maintenance is hard which increases the infrastructure.
In case of oracle:

It takes longer to update records when the fields in the clustering index are changed.

Avoiding clustering index constructions, there is a risk that many insertions will happen on almost the same clustering index value

Possibility that cluster s/w will fail in another sub system that would not occurred in standalone operation.

Recovering from unanticipated s/w or h/w state is difficult.

While performing management tasks showing of operator error,
SQL server:

Only one instance run which make other servers idle. Full utilization of money will not done.

Any more Configuration needs more license.

If failover occurs and one server is running then the performance is adversely affected.




KMEANS ALGORITHM:
By referring [2] we knew that in case of kmeans algorithm first we have to choose any no. of desired cluster, choose an initial point. Then we have to calculate classification,
centroid classification, and convergence classification. This algorithm is basically used in case of huge number of variables.
Algorithm:
Step 1: Set K Choose a number desired clusters,K
Step 2: Initialization To choose k starting points which are used as initial estimates of the cluster centroids. They are taken as the initial starting values.
i= some value ,i=1,…,k
Step 3: Classification To examine each point in the dataset and assign it to the cluster whose centroid is nearest to it. ci={j:d(xj,i)d(xj,l),li,j=1,…,n}
KMeans: Strength
Computationally faster execution occur in case of huge number of variables.
Kmeans produce tighter clusters, especially when clusters are globular
Step 4: Centroid calculation When each point in the data set is assigned to a cluster, it is needed to recalculate the new k centroids.
i=1cijcixj,i
Step 5: Convergence criteria The steps of (iii) and (iv) require to be repeated until no point changes its cluster assignment or until the centroids no longer move [7]
Time complexity is less. Faster computation
Weakness
For predicting Kvalue is difficult.
It does not work well in case of global cluster
Different initial partition can result in different final clusters.

FCM (FUZZYCMEANS) ALGORITHM:
It is one kind of method of clustering which allows a single piece of data to belonging more clusters using recognition of patterns. On the basis of minimizing following objective function are :
,
where m can be considered as any real number that must be greater than 1, uij is membership degree of xi in cluster j, xi is the ith of ddimensional measured data, cj is the ddimension center of the cluster, and * is any kind of norming which is expressing the equity in between any data measured and center. Partition of Fuzzy is carried out by an iteration that

Initialize U=[uij] matrix, U(0)
In case of different size and density of clusters, it does not work well.
too optimized of the objective function shown above, with the update of membership uij and the cluster centers cj by:
,
This iteration will stop when ,
where is a termination criterion between 0 and 1, whereas k are the iteration steps. At local minimized point of Jm, the procedure converges.
Algorithm:
k

At kstep: calculate the centers vectors C(k)=[cj] with U(k)
Subject to
uij 1
i1
j 1, , n
Solving constrained optimization
k
n
j ij j i j
ij

By Lagrangian multipliers
L U ,V =
u m
x v
2
k u
1


Update U(k) , U(k+1)
FCM
Strength
i1
j 1
i1

If  U(k+1) – then STOP; otherwise return back to step 2.
by updating the cluster centers iteratively and also the membership grades for data point [8].

Gives best result for overlapped data set and comparatively better then kmeans algorithm.

kmeans at data point where exclusively belong to one center at one cluster, here data point is assigned membership to each cluster center as a result of which data point may belong to more than one cluster center.
Weakness

Priority specification of the number of clusters.

With lower value of we get the better result but at the expense of more number of iteration.

Euclidean distance measures can unequally weight underlyingfactors.
k n m 2
E U ,V uij x j vi
Minimize
i1 j 1


COMPARISON IN BETWEEN KMEANS AND FCM ALGORITHM:

The time complexity of Kmeans is O(ncdi) and time complexity of FCM [3] is O(ndc2i)
1 2 
6000 12000 
6000 24000 

3 
18000 
54000 

4 
24000 
96000 

TIME COMPLEXITY OF KMEANS AND FCM WHEN NUMBER OF CLUSTERS VARYING
Number of
Clusters
KMeans
Time Complexity
FCM
Time Complexity
Fig 5 TC of KMeans and FCM by varying number of clusters [4] Fig 6 TC of KMeans and FCM by varying number of iterations [4]
After studying paper [4] we came to know that among both of the algorithms Kmeans is comparatively good in terms of different clusters. Time complexity of Kmeans algorithm is less than of FCM so more prefrable is kmeans.
Analysis after referring number of Papers: KMeans Algorithm
It was extremely faster than FCM in all datasets containing the clusters scattering in regular or irregular patterns[5]

In case of image clustering it produces fairly higher accuracy and requires less computation. [6]
By considering computation time it is best.[7]

produces fairly higher accuracy and requires less computation[8]

After apply normalization this clustering algorithm forms clusters with less time and more accuracy than other algorithms in bank document using WEKS tool.[9]
FuzzyCMeans Algorithm
It is an algorithm based on more iterative fuzzy calculations, so its execution was found comparatively higher as it is expected[5]
It produces nearly similar result but require more computations.[6]
It should be improved in terms of its computation time.[7]

more computation time is required than Kmeans because of the fuzzy measures calculations involved in the algorithm[8] not accurate [9]
CONCLUSION
In this paper we have got an overview about big data concepts, limitations, clustering, types of clustering and algorithms related to clustering. Finally we have studied partitioned clustering and its 2 main algorithms, at the same time comparison among FCM and Kmeans algorithm on the basis of several point out of papers. By bringing up several papers we came to know that both algorithms are having some demerits and merits mean while among two, FCM is having more demerits than Kmeans due to its computational algorithm and complexity in solving.
REFERENCES:

A Survey on Clustering Techniques in Medical Diagnosis

A Review: Clustering Techniques for Medical Image Segmentation. Juilee Anil Katkar, Trupti Baraskar Reseach Scholar, Department of Information Technology Assistant Professor, Department of Information Technology Maharashtra Institute of Technology,
PuneUniversity Pune 411048

A. Rui and J. M. C. Sousa, Comparison of fuzzy clustering algorithms for Classification, International Symposium on Evolving Fuzzy Systems, 2006 , pp. 112117.

Comparative Analysis of KMeans and Fuzzy CMeans Algorithms Soumi Ghosh Department of Computer Science and Engineering

Comparison of KMeans and Fuzzy CMeans Algorithms on Different Cluster Structures Zeynel Cebeci1, Figen Yildiz2

Comparitive Analysis Of K Means And Fuzzy C Means Algorithm by Poonam fauzdar and Sujata Kindri
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181 Vol. 2 Issue 6, June 2013

Performance Comparison of Fuzzy C Means with Respect to Other Clustering Algorithm Tejwant Singp, Mr. Manish Mahajan2 1 Research Scholar, 2 Associate Professor, Dept of IT, Chandigarh Engineering College, Mohali, India

A Comparative Analysis of Fuzzy CMeans Clustering and K Means Clustering Algorithms Mrs. Bharati R.Jipkate and Dr. Mrs.
V. V. Gohokar

A Comparative Analysis of Clustering Algorithms Raj bala Research Scholar (M.Tech) Amity University Haryana, India
Sunil Sikka, PhD.
Author Profile
Swayanshu Shanti Pragnya is currently in 4th year of computer science engineering at Centurion University of technology and management 201418. During 201617 researched about big data, big data issues, clustering algorithms, partitioned clustering
algorithm and FCM. Published a paper on digital image processing in IJSRD journal 2016.