Comparison of Enhanced DBSCAN Algorithms: A Review

DOI : 10.17577/IJERTCONV5IS01027

Download Full-Text PDF Cite this Publication

Text Only Version

Comparison of Enhanced DBSCAN Algorithms: A Review

Harsh Shinde

    1. Student of Computer Engineering Atharva College of Engineering, Mumbai University

      Mumbai, MH, India

      Amruta Sankhe

      Prof. of Computer Engineering

      Atharva College of Engineering, Mumbai University

      Mumbai, MH, India

      AbstractWhile data containing any type of information is getting ubiquitous, proper processing of such data is very much obliged. To classify spatial data, various clustering algorithms have been invented. DBSCAN (Density Based Spatial Clustering of Application with Noise) is one of the consummate clustering algorithm with respect to discovery of arbitrarily shaped cluster in a spatial datasets. Due to its flexibility and tremendous research potential, DBSCAN algorithm is one of the most cited in scholarly literature. Considering the fact that most researchers tried to experiment and evaluate the algorithm, certain modifications of DBSCAN were made in order to have more efficient outcomes or to reduce time complexities. This paper discusses such modified algorithms and how they surpass the original DBSCAN in certain ways. Thorough analysis of each algorithm is mentioned and their critical evaluation is done accordingly. These algorithms are then comparatively evaluated with regards to various parameters.

      Index TermsClustering algorithms, DBSCAN, DDCAR, RDD-DBSCAN, FastDBSCAN, DDBSCAN, DSets-DBSCAN

      1. Introduction

        Spatial data management needs proper evaluation and processing of any information to generate useful and desired data. For such implementation of data processing, the technique of Knowledge Discovery of Data (KDD), which is also known as Data Mining, is widely used. Clustering is a popularly used data mining process which classifies spatially represented datasets [1]. This classification of datasets results in formation of similar group of objects, possessing similar properties. To classify such type of data, a number of clustering algorithms have been invented and implemented, making classification of data into arbitrarily shaped, similar datasets called clusters. These clustering algorithms help to extract useful information generally in the form of patterns. Density- based clustering techniques are used to mine information containing large datasets.

        Due to emerging trends of big data, density-based clustering algorithms have been mostly cited in scientific literature due to its tremendous research potential and also possessing vast choices of enhancement in its original algorithms. Although the characteristics of other algorithms might work well with similar datasets, density based clustering algorithms is more efficient with respect to any variation of datasets. Out of many density-based clustering algorithms such as DBSCAN, DENCLUE, DBCLASD,

        OPTICS, etc [8][9], DBSCAN (Density-Based Spatial Clustering of Application with Noise) is one of the most universally acclaimed. DBSCAN and other density-based algorithms are important because they are unaffected by noise points and can handle clusters of various shapes and sizes. They are a lot of clusters that DBSCAN can find that other algorithms would not be able to find. DBSCAN searches for core objects, i.e., objects having dense neighborhood. DBSCAN has a number of applications ranging from machine learning library to weather analysis or at atmospheric science on a national scale [3]. Due to its vast experimental potential, many researchers have cited and made some changes in the original algorithm. Some algorithm tries to reduce the computational process while other algorithm reduces time complexity over substantially varied datasets.

        Researchers have been interested in DBSCAN since its proposal as this algorithm had some massive potential towards various data driven applications. Considering the emerging trends of big data and data mining tools, the applications of DBSCAN is also bringing effective results for various sets of data. DBSCAN possesses various limitations with respect to the variation and the size of datasets. Following such limitations, various advanced algorithms were invented for overcoming different types of shortcomings which the original DBSCAN possessed. These changes were made to enhance the restraints put forth by DBSCAN; some increase the effectiveness of the algorithm, while others produces similar results as the original algorithm but decreasing the time complexity taken by the DBSCAN [5].

        The following paper gives an insight towards some recent changes made in the original DBSCAN. These changes are subjective to the overall development of the original algorithm and to overcome various drawbacks associated with DBSCAN. The algorithms discussed in this paper, providing profound understanding of each, are stated as follows: DBSCAN [2], DDCAR [3], RDD-DBSCAN [4], FastDBSCAN [5], DDBSCAN [6] and DSets-

        DBSCAN [7]. Each of the given algorithms is thoroughly analyzed and their critical evaluation is done accordingly. Moreover, a separate comparison is made with regards to certain parameters, to obtain a holistic view of the changes made by each of the algorithms.

        The rest of the paper is organized as follows. We discuss the original DBSCAN algorithm in Section 2. In

        Section 3, we present the summary and elaboration of different algorithms which provide certain modifications and improvements to the original DBSCAN algorithm. Section 4 provides the conclusion of our overall paper.

      2. DBSCAN

        DBSCAN ( Density-Based Spatial Clustering of Application with Noise ) is a density-based data clustering algorithm which was proposed by Martin Ester, Hans-Peter Kriegal, Jörg Sander and Xiaowei Xu in 1996 [2]. The main purpose of the algorithm is to connect core objects (objects having dense neighborhoods) and their neighbors

        NeighborPts'

        }

        }

        if P' is not visited

        { mark P' as visited

        NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >= MinPts

        NeighborPts = NeighborPts joined with

        }

        if P' is not yet member of any cluster add P' to cluster C

        to form dense regions as clusters. For a set of points in space, the algorithm groups together closely packed points and the points in the low density region are called outliers. Outliers can be used to detect irrelevant information, usually produced during fraud detection. This algorithm is flexible and dynamic with respect to data.

        The steps of DBSCAN are given in the following points:

        • For a point, make n-dimensional sphere of radius Eps, for n-dimensional datasets.

        • Count the number of data points within the sphere. Indicate the value as p.

        • If p>min_pts, min_pts being the minimum points that should be present within the radius Eps, mark the center point to be a part of the cluster; also mark points inside Eps as a part of the cluster.

        • Repeat this step to the other points in the sphere except the center, to expand the cluster.

        • If p<min_pts, ignore the point p and proceed to another point in the dataset.

          Following is a pseudo-code algorithm of DBSCAN

          [10]:

          DBSCAN(D, eps, MinPts)

          { C = 0

          for each point P in dataset D

          { if P is visited continue next point mark P as visited

          NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) <

          MinPts mark P as NOISE else {

          C = next cluster

          expandCluster(P, NeighborPts, C, eps,

          MinPts)

          }

          }

          }

          expandCluster(P, NeighborPts, C, eps, MinPts)

          { add P to cluster C

          for each poit P' in NeighborPts {

          regionQuery(P, eps){ return all points within P's eps

          neighborhood (including P)

          }

          The overall average runtime complexity achieved by this algorithm is O(nlogn). The worst case runtime complexity is O(n²).

      3. LITERATURE REVIEW

        In this section, we will discuss and evaluate the different modifications of the original DBSCAN algorithm. Each of the following algorithms pinpoints various flaws on the implementation of DBSCAN algorithm. These flaws were then solved using certain approach towards each of the following algorithms. Complete evaluation of modified algorithms and relevant future work, if any, are also discussed. The following algorithms are the improved version of the DBSCAN:

        DDCAR [3] (Data Density clustering using Automated Radii) is a fully autonomous data density based clustering technique. This technique was proposed by Richard Hyde, Plamen Angelov in 2014. The algorithm automatically determines the number of clusters and derives suitable initial radii. This algorithm is non iterative, i.e., it assigns each sample to the appropriate cluster only once. The main advantages of this study over DBSCAN [2] are:

        • No prior knowledge of the number of cluster is required.

        • No initial radii or other user input required.

        • No knowledge of data density required.

        • Assignment of data to their original cluster is done accurately.

        • Time complexity is reduced and it is dependent on the clustering data.

          RDD-DBSCAN [4] is an algorithm proposed by Irving Cordova and Teng-Sheng Moh in 2015. This algorithm addresses large datasets utility of DBSCAN as it is not efficient while working with Resilient Distributed Datasets, which are a fast data processing abstraction created directly for in-memory computation of large datasets. Experimentally, RDD-DBSCAN scales as the number of nodes in a given cluster scale, while generating the same results as the sequential version of DBSCAN. The main advantages of this study over DBSCAN [2] are:

        • Overcoming the scalability limitations by operating in a fully distributed fashion.

        • The algorithm scales as the number of nodes in a given cluster scales, while generating similar results as the sequential version of DBSCAN.

        • Improving in-computation memory and is more efficient.

          Future work can be done in the following areas:

        • Loading complete datasets for a given partition into memory.

        • Selection of better partitioning scheme for the data space.

        FastDBSCAN [5] is proposed by Vu Viet Thang,

          1. Pantiukhin and A.I. Galushkin in 2015. This algorithm divides the data into k partitions (using k- means), then uses a min-max method to select points for DBSCAN clustering. It is a hybrid clustering algorithm as it uses a combination of k-means, min-max method and DBSCAN to recover the final clusters and outliers. The main advantages of this study over DBSCAN [2] are:

            • Overcoming quadratic time complexity (O(n2))

              processed in DBSCAN.

            • Improves computational time and accuracy. Directions for further research are given as follows:

            • Developing graph based clustering based on k- means.

            • Applying the algorithm in intrusion detection system datasets.

        DDBSCAN [6] (Different Densities-Based Spatial Clustering of Applications with Noise) is a modified version of DBSCAN [2]. It uses new concepts to deal with spatial

        datasets. This algorithm was proposed by M.F. Hassanin,

        1. Hassan and Abdalla Shoeb in 2015. The idea behind this algorithm is to define a density factor to the cluster and the object then define a threshold parameter as decision criteria to determine whether joining this object or not. On applying this technique, any cluster will contain indistinguishable density nodes. The main advantages of this study over DBSCAN [1] are:

          • Multi-density cluster handling is enhanced.

          • Effective clustering of adjacent clusters.

          • Clustering of noise points amongst adjacent clusters.

            Dsets-DBSCAN [7] is a parameter free hybrid clustering algorithm proposed by Jian Hou, Huijun Gao and Xuelong Li. This algorithm is mainly used in data clustering and image segmentation experiments. The algorithm functions in two main steps. First, we apply histogram equalization to similarity matrices before using it in clustering, to eliminate the regulation parameter. This makes the algorithm parameter-free. The next step is to Run Dsets clustering followed by clustering based on DBSCAN and extract cluster sequentially. The main advantages of this study over DBSCAN [2] are:

          • Considering careful parameter tuning, this algorithm performs better than conventional DBSCAN.

          • Efficiency is increased with regards to data clustering.

        TABLE I.

        COMPARISON OF ENHANCED DBSCAN ALGORITHMS

        Algorithm

        DBSCAN

        DDCAR

        RDD-DBSCAN

        FastDBSCAN

        DDBSCAN

        Dset-DBSCAN

        Features

        Grouping

        Automatically

        A modified

        Divides the data in

        Computing the density of a

        Application of histogram

        together closely

        determines the

        DBSCAN algorithm

        k partitioning

        cluster with respect to radius

        equalization to pairwise

        packed points,

        number of clusters

        which takes full

        (using k-means),

        value Eps and Min_pts.

        similarity of input data.

        called clusters.

        and derives suitable

        advantage of

        then using a min-

        Then provide density

        This makes the Dsets

        data-driven radii by

        Apache Sparks

        max method to

        threshold which is

        results independent of

        the use of recursive

        parallel capabilities

        select points for

        responsible for joining a

        user specified

        density equation.

        implemented in

        DBSCAN

        point to a certain cluster or

        parameters. Then extend

        Scala programming

        clustering.

        not.

        clusters from Dsets with

        language and run in

        DBSCAN.

        top of Apache

        Spark.

        Advantages

        Finds arbitrarily

        No prior

        Addresses large

        Overcomes

        Overcomes the problems of:

        Parameter-free clustering

        shaped clusters.

        knowledge of the

        datasets utility of

        quadratic time

        Multi-density cluster

        algorithm.

        Robust to

        number of clusters

        DBSCAN as it is not

        complexity

        discovery.

        Prevents over-

        outliers.

        is required.

        efficient while

        processed in

        Adjacent cluster discovery.

        segmentation of arbitrary

        No initial radi or

        working with RDDs.

        DBSCAN.

        Noise points amongst

        shape.

        other user input

        Overcomes

        Improves

        clusters.

        Effective in data

        required.

        scalability issues of

        computational time.

        clustering.

        No knowledge of

        the traditional

        Improves clustering

        the density of the

        DBSCAN algorithm

        accuracy.

        data is required.

        by operating in a

        Fewer calculation

        fully distributed

        by the use of

        fashion.

        recursive density

        Efficient

        equation.

        performance on

        td>

        Efficiency is

        Apache Spark

        significant with

        platform.

        larger datasets and

        Improved memory

        big data.

        management.

        Improved

        Efficiency.

        Disadvantages

        Complex

        In smaller datasets

        Limitation is the

        Needs more

        Careful parameter tuning

        computational

        the time advantage

        need to load the

        research to interpret

        is required for optimum

        costs.

        may be reduced

        complete dataset for

        these advantages of

        results.

        Prior knowledge

        due carrying out

        a given partition into

        partitioning-based

        of the number of

        the density

        memory.

        clustering and

        clusters is

        calculation twice

        Selection of better

        density-based

        required.

        per cluster when

        partitioning scheme

        clustering for

        Knowledge of

        adjusting the radii.

        for the data space is

        constructing hybrid

        data density

        Algorithm is non-

        unknown.

        clustering.

        required.

        iterative, assigns

        Improvement on n-

        Scalability

        each sample to the

        dimensional datasets

        limitation.

        appropriate cluster

        (n>1) is unknown.

        Multi-density

        only once.

        cluster

        Time complexity is

        discovery.

        dependent on the

        Noise points

        clustering data.

        amongst adjacent

        clusters.

        Adjacent cluster

        discovery.

        Uses heavy user

        specified

        parameters.

        Time

        Average runtime

        Varies. Depends on

        O(n*logn)

        Linear

        time

        complexity

        complexity :

        the size of the

        complexity.

        O(nlogn)

        dataset.

        Worst case

        runtime

        complexity :

        O(n²)

        Non matrix

        based

        implementation

        complexity :

        O(n)

        Parameters

        2

        (Eps

        and

        None

        3(Eps, Min_pts,

        3(Input dataset D,

        3 (Eps,Min_pts,Density

        3 parameters for initial

        needed

        Min_pts)

        Max_pts)

        number of cluster

        Threshold)

        Histogram Equalization

        for k-means k,

        technique(Dataset D,

        proportion of data

        Eps, Min_pts), none for

        t)

        Dsets+DBSCAN.

        Application

        Fraud detection

        Atmospheric

        Mainly used in

        Intrusion detection

        Simulation

        tasks.

        (Testing

        Data Clustering.

        and uses

        systems, data

        sciences by using

        Apache Spark.

        system by

        with

        real

        and

        artificial

        Image segmentation

        analysis.

        Hyper Pole to Pole

        clustering.

        datasets.)

        Observations

        Experimenting on

        (HIPPO) datasets.

        various datasets.

        Large climate

        based datasets.

      4. CONCLUSION

In this paper we have discussed and compared different modified DBSCAN algorithms. The algorithms discussed were DBSCAN, DDCAR, RDD-DBSCAN, FastDBSCAN,

DDBSCAN and DSets-DBSCAN. This evaluation was done due to many disadvantages which DBSCAN had regarding a number of its features. DDCAR and DSets- DBSCAN are parameter free clustering algorithms. They do not require a user input parameter. RDD-DBSCAN is effective while working with Resilient Distributed Datasets. FastDBSCAN improves computational time and accuracy by overcoming the quadratic time complexity possessed in the traditional DBSCAN. DDBSCAN combats the problem of multi-density clustering and generation of noise points amongst clusters. Thus we obtained a holistic view of the changes made by each of the algorithm.

REFERENCES

  1. Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd Edition, 2012.

  2. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in Proc. Int. Conf. Knowl. Discovery Data Mining, 1996, pages 226231..

  3. Richard Hyde, Plamen Angelov, A Fully Autonomous Data Density Based Clustering Technique in Evolving and Autonomous Learning Systems (EALS), 2014 IEEE Symposium, Orlando, FL on 9-12 Dec. 2014,pages 116 – 123.

  4. Irving Cordova,Teng-Sheng Moh, DBSCAN on Resilient Distributed Datasets in High Performance Computing & Simulation (HPCS), 2015 International Conference, Amsterdam,20- 24 July 2015,pages 531 – 540.

  5. Vu Viet Thang ,D. V. Pantiukhin ,A. I. Galushkin, A Hybrid Clustering Algorithm: The FastDBSCAN in 2015 International Conference on Engineering and Telecommunication (EnT), Moscow, 18-19 Nov. 2015, pages 69 – 74.

  6. Mohammad F. Hassanin,Mohamed Hassan,Abdalla Shoeb, DDBSCAN: Different Densities-Based Spatial Clustering of Applications with Noise in 2015 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kumaracoil, 18-19 Dec. 2015, pages 401

    – 404.

  7. Jian Hou, Huijun Gao, Xuelong Li, DSets-DBSCAN: A Parameter- Free Clustering Algorithm in IEEE Transactions on Image Processing (Volume: 25, Issue: 7),2016, pages 3182 – 3193.

  8. Garima,Hina Gulati,P.K.Singh, Clustering Techniques in Data Mining: A Comparison in 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom),2015,Pages 410 – 415.

  9. Saif ur Rehman,Sohail Asghar,Simon Fong ,S. Sarasvady, DBSCAN: Past, present and future in Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference,Bangalore,17-19 Feb. 2014, pages 232 – 238.

  10. Wikipedia. (2016, May, 27) DBSCAN from Wikipedia, the free encyclopedia. [Online] Available at: https://en.wikipedia.org/wiki/DBSCAN#Algorithm.

Leave a Reply