 Open Access
 Total Downloads : 135
 Authors : D Veeraiah, Dr. D Vasumathi
 Paper ID : IJERTV3IS20653
 Volume & Issue : Volume 03, Issue 02 (February 2014)
 Published (First Online): 24022014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A link concerning various clusters using hierarchical clustering Techniques
1 D Veeraiah, 2 Dr. D Vasumathi
1 Research Scholar, Dept of CSE, JNTUK, Kakinada, Andhra Pradesh, India
2 Professor, Department of CSE, JNTUCEH, Hyderabad, Andhra Pradesh, India
Abstract The process of grouping a set of physical or abstract object into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to another within the same cluster and are dissimilar to the objects in other clusters. The Hierarchical clustering analysis procedure for finding relationships in various clusters. It works by grouping data objects into a tree of clusters. These methods can be classified agglomerative or divisible, depending on whether the hierarchical decomposition is formed in a bottomup or top down fashion The main aim of paper is to connection among different clusters using the single link and completed link techniques using Euclidian distance and Manhattan distance, words methods and centroid methods. The clustering that is produced is different from those produced by single link, complete link, group average .the wards method ,the proximity between two clusters is defined as the increase in the squared error that results when two cluster are merged and the centriod method calculate the proximity between two clusters by calculating the distance between the centriod of clusters.
Fig : 1.1 Nested Diagram

Basic Agglomerative Hierarchical clustering Algorithm
Algorithm:
Keywords Euclidian, single link, complete link, group average, centriod, Proximity

INTRODUCTION
Hierarchical clustering techniques are second important category of clustering method. There are two basic approaches for generating a Hierarchical clustering

Agglomerative

Divisive
The Agglomerative start with the points as individual clusters and, at each step, merge the closet pair of clusters and divisive start with one, allinclusive clusters and, at each step, split a cluster until only singleton clusters of individual points remain.
A hierarchical clustering is often displayed graphically using a tree like diagram called dendrogram, which displays both the clusterssub clusters relationship and order in which the clusters were merged. A hierarchical clustering can also be graphically represented using nested cluster diagram.
The following nested diagram example of these two types of figures for a set of four two dimensional points. These points were clustered using the single link technique.

Compute the proximity matrix, if necessary

Repeat

Merge the closest two clusters

Update the proximity matrix to reflect the proximity between the new clusters and the original clusters

Until only one cluster remains




Defining Proximity between clusters
The key operation of Agglomerative Hierarchical clustering Algorithm computing of the proximity between two clusters, and it is the definition of clusters proximity. Cluster proximity is typically defined with particular type of cluster in mind. For example Agglomerative Hierarchical clustering techniques, such as, MIN, MAX and group average, come from a graph based view of clusters. IN defines cluster proximity as the proximity between the closest two points that are in different clusters ,or using graph terms, the shortest edge between two nodes in different subset of nodes. This yields contiguitybased clusters. The contiguitybased cluster was each point is closer to at least one point in its cluster than to any point in another cluster.
Alternatively MAX takes the proximity between the farthest two points in different clusters to be the cluster proximity, are using graph terms, the longest edge between two nodes in different subjects of nodes. Another graph based approach, the group average technique, defines cluster proximity be the average pair wise proximities of all pairs of point s from different clusters.
Fig: 1.2.1 MIN (single Link)
Fig: 1.2.2 MAX (complete Link)
Fig: 1.2.3 Group average

LITERATURE SURVEY
1. Agglomerative MeanShift Clustering by XiaoTong Yuan, BaoGang Hu,
In this paper, for the purpose of algorithmic speedup, we develop an agglomerative MS clustering method along with its performance analysis. Our method, namely AggloMS, is built upon an iterative query set compression mechanism which is motivated by the quadratic bounding optimization nature of MS algorithm. The whole framework can be efficiently implemented in linear running time complexity. We then extend AggloMS into an incremental version which performs comparably to its batch counterpart. The efficiency and accuracy of AggloMS are demonstrated by extensive comparing experiments on synthetic and real data sets.

A Variation Bayesian Framework for Clustering with Multiple Graphs by Motoki Shiga and Hiroshi Mamitsuka
Mining patterns in graphs has become an important issue in real applications, such as bioinformatics and web mining. We address a graph clustering problem where a cluster is a set of densely connected nodes, under a practical setting that
1) the input ismultiple graphs which share a set of nodes but have different edges and 2) a true cluster cannot be found in all given graphs. For this problem, we propose a probabilistic generative model and a robust learning scheme based on variation Bayesian estimation. A key feature of our probabilistic framework is that not only nodes but also given graphs can be clustered at the same time, allowing our model to capture clusters found in only part of all given graphs. We empirically evaluated the effectiveness of the proposed frame work on not only a variety of synthetic graphs but also real gene networks, demonstrating that our proposed approach can improve the clustering performance of competing methods in both synthetic and real data.

A LinkBased Cluster Ensemble Approach for Categorical Data Clustering by Natthakan IamOn, Tossapon Boongoen, Simon Garrett, and Chris Price
Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the results being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition based on incomplete information. The underlying ensembleinformation matrix presents only clusterdata point relations, with many entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and it presents a new linkbased approach, which improves the conventional matrix by discovering unknown entries through similarity between clusters in an ensemble. In particular, an efficient linkbased algorithm is proposed for the underlying similarity assessment. Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results on multiple real data sets suggest that the proposed linkbased method almost always Out performs both conventional clustering algorithms for categorical data and wellknown cluster ensemble techniques.


METHODOLOGY
We fallow the methodology for connection among the different clusters

Single Link(MIN)

Complete Link or CLIQUE(MAX)

Group Average Method

Wards and Centroids Methods
By using the sample data that consist of 6 twodimensional points
Point
X
Coordinate
Y
Coordinate
p1
0.4
0.53
p2
0.22
0.38
p3
0.35
0.32
p4
0.26
0.19
p5
0.08
0.41
p6
0.45
0.3
Fig: 3.1 x y coordinates of six points
0.6
0.4
0.2
0
0
0.2
0.4
0.6
p1
p2
p3
p4
p5
p6
p1
0
0.24
0.22
0.37
0.34
0.23
p2
0.24
0
0.15
0.2
0.14
0.25
p3
0.22
0.15
0
0.15
0.28
0.11
p4
0.37
0.2
0.15
0
0.29
0.22
p5
0.34
0.14
0.28
0.29
0
0.39
p6
0.23
0.25
0.11
0.22
0.39
0
Fig: 3.2 set of 6 two dimensional points
time, shortest link first, then a group of points is not a cluster until all the points in it or completely linked, i.e. such that from a CLIQUE .complete link is susceptible to noise e and outliers, but it can break large clusters and it favors globular shapes.
Fig: 3.2.1Complete Link Clustering



Group Average Method
The proximity of two clusters is defined as the average pair wise proximity among all pairs of points in the different clusters. This is an intermediate approach between the single and complete link approaches. Thus, for group average, the cluster proximity (Ci, Cj) of clusters Ci and Cj, which are of size mi and mj respectively, is, expressed by the following equation
Fig: 3.3 Euclidian Distance Matrix for 6 points

Single Link (MIN)
The proximity of two clusters is defined as the Minimum of the distance between any two points in the two different clusters .using graph terminology if u start with all points as singleton clusters and add links between points one at time, shortest link first, then these single link combine the points into clusters .the single link techniques is good at handling non elliptical shapes, but is sensitive noise and outliers
Fig: 3.1.1 Single Link Clustering

Complete Link or CLIQUE (MAX)
The proximity of two clusters is defined as the maximum of distance between any two points in the two different clusters. using graph terminology, if you start with all points as singleton clusters and add links between points one at a
Proximity (Ci, Cj)
=( (, ))/mi*mj
Fig: 3.3.1 Group Average
D. Wards and Centroids Methods
The proximity between two clusters is defined as the increase in the squared error that results when two clusters are merged. Thus, this method uses the same objective function as Kmeans clustering. While it may see that this feature makes wards method somewhat distinct from other hierarchical techniques, it can be shown mathematically that wards method is very similar to the group average method when the proximity between two points is taken to be the square of the distance between them.
Centroid methods calculate the proximity between two clusters by calculating the distance between the centroids of clusters. This technique may seem similar to Kmeans, but
as we have remarked, wards method is the correct hierarchical analog. These methods also have a characteristics often consider bad that is not processed by the other hierarchical clustering techniques that we have discussed: the possibility of inversions. Specifically two clusters that are merged may be more similar then the pair of clusters that are merged in a previous step. For the other methods, the distance between merged clusters monotonically increases as we proceed from singleton clusters two one allinclusive cluster

RESULT ANALYSIS
We analysis the results based the on the sample data represented in graphical form
A. Single Link:
The distance between two clusters (3, 6) and (2, 5) by using Euclidian Distance Dist({3,6},{2,5})=min(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
=min (0.15, 0.25, 0.28, 0.39)=0.15
Fig: 4.2.1 Graphs for Complete Link
C. Group Average
We calculate the distance between some clusters dist({3,6,4},{1}) = ( 0.22+0.37+0.23)/(3*1)
=0.28 dist({2,5},{1})=(0.2357+0.3421)/(2*1)=0.2889 dist({3,6,4},{2,5})
=(0.15+0.28+0.25+0.39+0.20+0.29)/(6*2)
=0.26
Fig: 4.4.1 Graphs for Single Link
B. Complete Link
As with a single link, points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of {2, 5} or
{1} because
dist({3,6},{}4) =max(dist(3,4),dist(6,4))
=max (0.15, 0.22)
= 0.22
dist ({3, 6}, {2, 5})
=max(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
=max (0.15, 0.25, 0.28, 0.39)
=0.39
dist({3,6},{1}) = max(dist(3,1),dist(6,1))
=max (0.22, 0.23)
= 0.23
Fig: 4.3.1 Graphs for Group Average
D. Wards Method
The clustering that is produced in different from those produced by single link, complete link and group average.
Fig: 4.4.1 Graph for Wards Method

CONCLUSION
The Hierarchical clustering analysis procedure for finding relationships in various clusters. It works by grouping data objects into a tree of clusters. These methods can be classified agglomerative or divisible, depending on whether the hierarchical decomposition is formed in a bottomup or top down fashion.
REFERENCES

S.E. Schaeffer, Graph Clustering, Computer Science Rev., vol. 1, pp. 2764, 2007.

S. Arora, S. Rao, and U. Vazirani, Geometry, Flows, and Graph Partitio Algorithms,Comm. ACM, vol. 51, no. 10, pp. 96105,2008.

D. Beeferman and A. Berger, Agglomerative Clustering of a Search Engine Query Log,Proc. ACM SIGKDD Intl Conf.Knowledge Discovery and Data Mining, pp. 407416, 2000.

M. Bilenko, S. Basu, and R. Mooney, Integrating Constraints and Metric Learning in SemiSupervised Clustering, Proc. Intl Conf. Machine Learning, pp. 8188, 2004.

I. Davidson and S.S. Ravi, Using InstanceLevel Constraints in Agglomerative Hierarchical Clustering: Theoretical and Empirical Results, Data Mining and Knowledge Discovery, vol. 18, no. 2, pp. 257282, Apr. 2009.

L. Dragomirescu and T. Postelnicu, A Natural Agglomerative Clusteringmethod forBiology, Biometrical J., vol. 33, no. 7, pp. 841 849, Jan. 2007.

M. Fashing and C. Tomasi, Mean Shift Is a Bound Optimization, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 471474, Mar. 2005.

D. Freedman and P. Kisilev, Fast Mean Shift by Compact Density Representation, Proc. IEEE Intl Conf. Computer Vision and Pattern Recognition, 2009.

K. Fukunaga and L. Hostetler, The Estimation of the Gradient of a Density Function, with Application in Pattern Recognition, IEEE Trans. Information Theory, vol. 21, no. 1, pp. 3240,Jan. 1975.

M.R. Garey and D.S. Johnson, Computersand Intractability, A Guide to the Theory of NPCompleteness. Freeman, 1979.

B. Georgescu, I. Shimshoni, and P. Meer, Mean Shift Based Clustering in High Dimensions: A Texture Classification Example, Proc. IEEE Intl Conf. Computer Vision, vol. 1, pp. 456463, 2003.

S. Guha, R. Rastogi, and K. Shim, Cure: An Efficient Clustering Algorithm for Large Databases, Proc. ACM SIGMOD Intl Conf. Management of Data, pp. 7384, 1998.

A.K. Jain, M.N. Murty, and P. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol. 31, no. 3, pp. 264323, 1999.

K. Lang, Newsweeder: Learning to Filter Netnews, Proc. Intl Conf. Machine Learning, pp. 331339, 1995.

S. Paris and F. Durand, A Topological Approach to Hierarchical Segmentation Using Mean Shift, Proc. IEEE Intl Conf. Computer Vision and Pattern Recognitio, 2007.

L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Publishers, 1990.

A.K. Jain and R.C. Dubes, Algorithms for Clustering. PrenticeHall, 1998.

P. Zhang, X. Wang, and P.X. Song, Clustering Categorical Data Based on Distance Vectors, The J. Am. Statistical Assoc., vol. 101, no. 473, pp. 355367, 2006.

J. Grambeier and A. Rudolph, Techniques of Cluster Algorithms in Data Mining, Data Mining and Knowledge Discovery, vol. 6, pp. 303360, 2002.

K.C. Gowda and E. Diday, Symbolic Clustering Using a New Dissimilarity Measure, Pattern Recognition, vol. 24, no. 6, pp. 567 578, 1991.

Z. Huang, Extensions to the KMeans Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery, vol. 2, pp. 283304, 1998.

Z. He, X. Xu, and S. Deng, Squeezer: An Efficient Algorithm for Clustering Categorical Data, J. Computer Science and Technology, vol. 17, no. 5, pp. 611624, 2002.

D.H. Fisher, Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, vol. 2, pp. 139172, 1987.

D. Gibson, J. Kleinberg, and P. Raghavan, Clustering Categorical Data: An Approach Based on Dynamical Systems, VLDB J., vol. 8, nos. 34, pp. 222236, 2000.

V. Ganti, J. Gehrke, and R. Ramakrishnan, CACTUS: Clustering Categorical Data Using Summaries, Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD), pp. 7383, 1999.

A. Gionis, H. Mannila, and P. Tsaparas, Clustering Aggregation, Proc. Intl Conf. Data Eng. (ICDE), pp. 341352, 2005.

N. Nguyen and R. Caruana, Consensus Clusterings, Proc. IEEE Intl Conf. Data Mining (ICDM), pp. 607612, 2007.

C. Domeniconi and M. AlRazgan, Weighted Cluster Ensembles: Methods and Analysis, ACM Trans. Knowledge Discovery from Data, vol. 2, no. 4, pp. 140, 2009.

X.Z. Fern and C.E. Brodley, Solving Cluster Ensemble Problems
by Bipartite Graph Partitioning, Proc. Intl Conf. Machine Learning (ICML), pp. 3643, 2004.

A. Strehl and J. Ghosh, Cluster Ensembles: A Knowledge Reuse Framework for Combining Multiple Partitions, J. Machine Learning Research, vol. 3, pp. 583617, 2002.
BIOGRAPHIES
D.Veeraiah, Research Scholar in JNTUK, Kakinada.M.Tech in CSE from JNTUH, B.Tech in CSIT from JNTUH. His areas of interest Data Mining, Network Security
Dr D.Vasumathi Ph.D from JNTU Hyderabad. Currently Working as Professor in Department of CSE, JNTUCEH. Her areas of interest Data Mining, Network Security, Ad hoc Networks, Software Engineering