 Open Access
 Total Downloads : 98
 Authors : V. Palanisamy, A. Kumarkombaiya
 Paper ID : IJERTV2IS120748
 Volume & Issue : Volume 02, Issue 12 (December 2013)
 Published (First Online): 24122013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Analysing Pharmaceutical Compounds Based On Cluster Techniques
V. Palanisamy1, A. Kumarkombaiya2
1Professor and Head1, Department of Computer Science and Engineering, Alagappa University
2Assistant Professor, Department of Computer Science, Chikkanna Government Arts College
Abstract
Data mining is the practice of automatically searching large stores of data to discover relations that connect variables in a database. The major issues in data mining research partitioning them into several groups: mining methodology, user interaction, efficiency and scalability. In this paper focus on a performance of clustering techniques quite broadly to refer comparative between BIRCH, CURE, Chameleon algorithm method that allows us to gain insights into the biological actions of chemicals by analyzing large amounts of data.
Keyword: BIRCH, CURE and Chameleon algorithm.

Introduction
Cheminformatics is an area of application which was found the molecular structure of drugs contains groups of atoms like carbon, hydrogen, oxygen and nitrogen (pharmaceutical drug discovery, databases of available chemicals) are connected to gather to form different functional groups. Main goal of this work is grouping of chemical compound structures that have similar functionalities by using clustering technique for grouping between the atoms in the molecular structure and found the searching performance of BIRCH, CURE and Chameleon algorithm.

Clustering Techniques
Clustering is a task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups [8]. This is done by classifying the objects. Owing to huge amount of data collected in databases, cluster analysis becomes a highly active topic in data mining research. The overall process of clustering involves the following steps:

Generate appropriate descriptors for each object in the data set.

Select an appropriate similarity measure.

Use an appropriate clustering method to cluster the data set.

Analyze the results.
In this paper, refer Fig: 1 Structure of Saxagliptin is a proposed compound drugs data set for analysis grouping of data and searching the performance using BIRCH, CURE and chameleon algorithm.
Fig 1 : Chemical Structure of Saxagliptin


Comparing Cluster Method
Agglomerative hierarchical clustering algorithms such as CURE, ROCK, Chameleon to analyse the cluster to merge the grouping of data [1]. CURE measures the similarity of two clusters based on the similarity of the closest pair of the representative points belonging to different clusters, without considering the homogeneity of the two clusters. CURE ignores the information about the aggregate interconnectivity of objects in two clusters [1][6]. ROCK observed the similarity function tends to merge clusters which have disjoint set of attributes. After drawing a random sample from the database, the hierarchical clustering algorithm that employs links is applied to the sample objects and it starts with singleton object as an individual class and progressively merges the clusters based on the goodness criteria determined by the link structure [1][6].
Chameleon uses a graph partitioning algorithm to partition graph knearest graph into a
large number of relatively small sub clusters. The cluster C is partition into sub clusters Ci and Cj. Then it uses an agglomerative hierarchical clustering algorithm that iteratively merges sub clusters based on their similarity. Chameleon determine similarity between each pain of clusters Ci and Cj according to their relative connectivity, RI (Ci , Cj ) and their relative closeness RC(Ci, Cj). It was not applied to high dimensions [6][7].

Birch Algorithm
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [2] an algorithm is an agglomerative hierarchical clustering method which builds a dendrogram called clustering feature tree (CF tree) while scanning the data set to condense information about subcluster of points. It contains two key phases such as i) Scans the database to build an inmemory tree and ii) Applies clustering algorithm to cluster the leaf nodes. Birch handles the task in a very novel manner. It maintains a set of Cluster Features (CF) of the subcluster. The criteria for merging two subclusters are taken from the information provided solely by the set of CFs of the respective subcluster. Each entry in the CF tree represents a cluster of objects and is characterized by a triple feature as (N, LS, SS), where N is the number of data objects in the cluster and LS is the linear sum of the data object and SS is the square sum of the data object in the cluster.
C = n;
Xi= LS
4.1 CF Tree
CF tree based on some threshold value and distant measurement (branching factor) to make a cluster from the tree. Since the parameters of this algorithm include centroid, average distance of all entries, each leaf node can have two pointers previous and next, which can be used to chain all leaf node together for an efficient scan. A leaf node represents a cluster made up of all subclusters represents by its entries. All entries in a leaf node must satisfy the threshold requirements with respect to threshold value, the diameter of the subcluster must be less than threshold value. A CF tree is built dynamically and incrementally as new data objects are inserted. The main task of this algorithm is performance analysis right time and memory space.
Phase 1: Loading the input data, Phase 2: Condensation of CF tree by redefining the threshold, tree size become manageable. Phase 3: Global clustering and Phase 4: Cluster refining this is optional, and requires more passes over the data to refine the results.
It make cluster from this tree based on some threshold value and distant measurement. The next step of the algorithm is to take two random points and find the portion of the clusters cover by the area. The main tasks of algorithm are analysis the performance and cluster the given data set into merged one.
The output file contains the parameters along with Centroid Euclidean intercluster distance measure for the two clusters C1 and C2 with cluster Centroid[3] as
iC
X2 = SS
and
D (C ,C )= (Xi
– Xi
1
)2
iC i
0 i j
i
centroid,i centroid,j
Since the objects are multi dimensional, the summations and exponentiation in the above expressions are dealt with componentwise. The dimensions of LS and SS are the same as that of the object in Cluster.
BIRCH assumes that the distance functions between clusters are so defined that they can be expressed in terms of the parameters encoded as (N, LS, SS) respectively. If CF1 = (N1, LS1, SS1), and CF2 = (N2 , LS2, SS2) are the CF entries of two disjoint subclusters.
The CF entries of the subcluster formed by merging the two disjoin subclusters is:
CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)
The tree will be stored in the format root of the tree which is a special node with two pointers left and right. It also contains the intermediate node information along with leaf nodes which mainly contains the original data points from the database. BIRCH is a scalable clustering algorithm with respect to the number of objects and good quality of clustering of the data.
Pharmaceutic al compound data
Set from PubChem
Tabulation Based On Atom Type and Its Connectivity
Atom 1
Atom 2
Connected Bond
1
7
1
1
46
1
2
21
1
3
17
1
3
21
1
3
22
1
4
16
1
4
47
1
4
48
1
5
23
1
6
10
1
6
11
1
:
:
:
:
:
:
Atom 1
Atom 2
Connected Bond
1
7
1
1
46
1
2
21
1
3
17
1
3
21
1
3
22
1
4
16
1
4
47
1
4
48
1
5
23
1
6
10
1
6
11
1
:
:
:
:
:
:
Grouping on Connectivities Between Atoms Get Code In Table
Final Clusters
Analyse the data by BIRCH, CURE
and Chameleon algorithm
Table 2: Data set contains Atoms with Connected Bonds
Fig 2: Finding cluster from given data
Saxagliptin chemical compound structure is taken as input into those algorithms [9]. The input to the system is matrix which contains atom number, atom type, and connected atom with bonding.From Fig: 2 represent the methodology of clustering techniques to form the group of atoms in respective way.
From Table 1, the data set contains atom number and atom type then to denote the connected atoms as in single or double bond respectively refer from Table 2. In paper it takes atoms from dataset and pairs are connected atom into the bond as given below.

Grouping of Cluster
From Fig: 3, after grouping the similar codes founded by cluster from the given data set which is based on atom number by connected atoms. This process is continuing until the completion of end of data set.
Atom No
Atom Type
1
O
2
O
3
N
4
N
5
N
6
C
7
C
8
C
9
C
10
C
11
C
12
C
:
:
Atom No
Atom Type
1
O
2
O
3
N
4
N
5
N
6
C
7
C
8
C
9
C
10
C
11
C
12
C
:
:
Cluster Id : 1 OC OH
Cluster ID
: 6
CC CC
Cluster Id:2
Cluster ID : 5
Cluster ID : 3 NC NC NC
Table 1: Data Set contain Atom type
Fig: 3 Grouping of Objects

Performance Factor
The performance of hierarchical BIRCH clustering algorithms is presented in this section. Pharmaceutical data set has been taken into consideration and grouping the data. From the fig: 4 represent an analysis of cluster performance and time taken for required BIRCH algorithm with other CURE and Chameleon algorithm respectively.
and Computer Science Vol .5 No.3 (2012) 229240.
[7] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques. Publication: ISBN10: 0123814790  ISBN13: 9780123814791, Edition: 3 [8]https://sites.google.com/site/dataclusteringalgorithms/hierar chicalclusteringalgorithm
[9]http://www.ncbi.nlm.nih.gov/pccompound/?term=saxaglipt in.
Time
Performance
Fig 4: Analysis the performance and Time taken for cluster techniques

Conclusion
Clustering is one of the techniques in Data mining for grouping similar objects. In this paper we compared the Hierarchical clustering algorithm to merging the data set from PubChem. The input data is Saxagliptin chemical compound dataset for grouping the data and merge in efficient manner. BIRCH incrementally and dynamically clusters incoming multidimensional metric data points for producing the best quality clustering with a single scan of the data and to find better accuracy results for searching the data rather than the other ones for given datasets.

Reference

Arun K Pujari, Data mining techniques, University Press (India) Private Limited.

T. Zhang, R. Ramakrishnan and M. Livny: BIRCH : An Efficient Data Clustering Method for Very Large Databases. SIGMOD 96 6/96 Montreal, CanadaIQ1996ACM0897917944/96/0006
.[3] Daniel T. Larose , Data Mining Methods and Models, Copyright Â© 2006 John Wiley and Sons, Inc.

David Hand, Heikki Mannila & adhraic Smyth, Principles of Data Mining, MIT Press, 2001

Arun K Pujari, Data mining techniques, University Press (India) Private Limited

Marjan Kuchaki Rafsanjani, Zahra Asghari Varzaneh, Nasibeh Emami Chukanlo, A survey of hierarchical clustering algorithms, The Journal of Mathematics