Comparison of Enhanced DBSCAN Algorithms: A Review

DOI : 10.17577/IJERTCONV5IS01027

Download Full-Text PDF Cite this Publication

Text Only Version

Comparison of Enhanced DBSCAN Algorithms: A Review

Harsh Shinde

    1. Student of Computer Engineering Atharva College of Engineering, Mumbai University

      Mumbai, MH, India

      Amruta Sankhe

      Prof. of Computer Engineering

      Atharva College of Engineering, Mumbai University

      Mumbai, MH, India

      AbstractWhile data containing any type of information is getting ubiquitous, proper processing of such data is very much obliged. To classify spatial data, various clustering algorithms have been invented. DBSCAN (Density Based Spatial Clustering of Application with Noise) is one of the consummate clustering algorithm with respect to discovery of arbitrarily shaped cluster in a spatial datasets. Due to its flexibility and tremendous research potential, DBSCAN algorithm is one of the most cited in scholarly literature. Considering the fact that most researchers tried to experiment and evaluate the algorithm, certain modifications of DBSCAN were made in order to have more efficient outcomes or to reduce time complexities. This paper discusses such modified algorithms and how they surpass the original DBSCAN in certain ways. Thorough analysis of each algorithm is mentioned and their critical evaluation is done accordingly. These algorithms are then comparatively evaluated with regards to various parameters.

      Index TermsClustering algorithms, DBSCAN, DDCAR, RDD-DBSCAN, FastDBSCAN, DDBSCAN, DSets-DBSCAN

      1. Introduction

        Spatial data management needs proper evaluation and processing of any information to generate useful and desired data. For such implementation of data processing, the technique of Knowledge Discovery of Data (KDD), which is also known as Data Mining, is widely used. Clustering is a popularly used data mining process which classifies spatially represented datasets [1]. This classification of datasets results in formation of similar group of objects, possessing similar properties. To classify such type of data, a number of clustering algorithms have been invented and implemented, making classification of data into arbitrarily shaped, similar datasets called clusters. These clustering algorithms help to extract useful information generally in the form of patterns. Density- based clustering techniques are used to mine information containing large datasets.

        Due to emerging trends of big data, density-based clustering algorithms have been mostly cited in scientific literature due to its tremendous research potential and also possessing vast choices of enhancement in its original algorithms. Although the characteristics of other algorithms might work well with similar datasets, density based clustering algorithms is more efficient with respect to any variation of datasets. Out of many density-based clustering algorithms such as DBSCAN, DENCLUE, DBCLASD,

        OPTICS, etc [8][9], DBSCAN (Density-Based Spatial Clustering of Application with Noise) is one of the most universally acclaimed. DBSCAN and other density-based algorithms are important because they are unaffected by noise points and can handle clusters of various shapes and sizes. They are a lot of clusters that DBSCAN can find that other algorithms would not be able to find. DBSCAN searches for core objects, i.e., objects having dense neighborhood. DBSCAN has a number of applications ranging from machine learning library to weather analysis or at atmospheric science on a national scale [3]. Due to its vast experimental potential, many researchers have cited and made some changes in the original algorithm. Some algorithm tries to reduce the computational process while other algorithm reduces time complexity over substantially varied datasets.

        Researchers have been interested in DBSCAN since its proposal as this algorithm had some massive potential towards various data driven applications. Considering the emerging trends of big data and data mining tools, the applications of DBSCAN is also bringing effective results for various sets of data. DBSCAN possesses various limitations with respect to the variation and the size of datasets. Following such limitations, various advanced algorithms were invented for overcoming different types of shortcomings which the original DBSCAN possessed. These changes were made to enhance the restraints put forth by DBSCAN; some increase the effectiveness of the algorithm, while others produces similar results as the original algorithm but decreasing the time complexity taken by the DBSCAN [5].

        The following paper gives an insight towards some recent changes made in the original DBSCAN. These changes are subjective to the overall development of the original algorithm and to overcome various drawbacks associated with DBSCAN. The algorithms discussed in this paper, providing profound understanding of each, are stated as follows: DBSCAN [2], DDCAR [3], RDD-DBSCAN [4], FastDBSCAN [5], DDBSCAN [6] and DSets-

        DBSCAN [7]. Each of the given algorithms is thoroughly analyzed and their critical evaluation is done accordingly. Moreover, a separate comparison is made with regards to certain parameters, to obtain a holistic view of the changes made by each of the algorithms.

        The rest of the paper is organized as follows. We discuss the original DBSCAN algorithm in Section 2. In

        Section 3, we present the summary and elaboration of different algorithms which provide certain modifications and improvements to the original DBSCAN algorithm. Section 4 provides the conclusion of our overall paper.

      2. DBSCAN

        DBSCAN ( Density-Based Spatial Clustering of Application with Noise ) is a density-based data clustering algorithm which was proposed by Martin Ester, Hans-Peter Kriegal, Jörg Sander and Xiaowei Xu in 1996 [2]. The main purpose of the algorithm is to connect core objects (objects having dense neighborhoods) and their neighbors




        if P' is not visited

        { mark P' as visited

        NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >= MinPts

        NeighborPts = NeighborPts joined with


        if P' is not yet member of any cluster add P' to cluster C

        to form dense regions as clusters. For a set of points in space, the algorithm groups together closely packed points and the points in the low density region are called outliers. Outliers can be used to detect irrelevant information, usually produced during fraud detection. This algorithm is flexible and dynamic with respect to data.

        The steps of DBSCAN are given in the following points:

        • For a point, make n-dimensional sphere of radius Eps, for n-dimensional datasets.

        • Count the number of data points within the sphere. Indicate the value as p.

        • If p>min_pts, min_pts being the minimum points that should be present within the radius Eps, mark the center point to be a part of the cluster; also mark points inside Eps as a part of the cluster.

        • Repeat this step to the other points in the sphere except the center, to expand the cluster.

        • If p<min_pts, ignore the point p and proceed to another point in the dataset.

          Following is a pseudo-code algorithm of DBSCAN


          DBSCAN(D, eps, MinPts)

          { C = 0

          for each point P in dataset D

          { if P is visited continue next point mark P as visited

          NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) <

          MinPts mark P as NOISE else {

          C = next cluster

          expandCluster(P, NeighborPts, C, eps,





          expandCluster(P, NeighborPts, C, eps, MinPts)

          { add P to cluster C

          for each poit P' in NeighborPts {

          regionQuery(P, eps){ return all points within P's eps

          neighborhood (including P)


          The overall average runtime complexity achieved by this algorithm is O(nlogn). The worst case runtime complexity is O(n²).


        In this section, we will discuss and evaluate the different modifications of the original DBSCAN algorithm. Each of the following algorithms pinpoints various flaws on the implementation of DBSCAN algorithm. These flaws were then solved using certain approach towards each of the following algorithms. Complete evaluation of modified algorithms and relevant future work, if any, are also discussed. The following algorithms are the improved version of the DBSCAN:

        DDCAR [3] (Data Density clustering using Automated Radii) is a fully autonomous data density based clustering technique. This technique was proposed by Richard Hyde, Plamen Angelov in 2014. The algorithm automatically determines the number of clusters and derives suitable initial radii. This algorithm is non iterative, i.e., it assigns each sample to the appropriate cluster only once. The main advantages of this study over DBSCAN [2] are:

        • No prior knowledge of the number of cluster is required.

        • No initial radii or other user input required.

        • No knowledge of data density required.

        • Assignment of data to their original cluster is done accurately.

        • Time complexity is reduced and it is dependent on the clustering data.

          RDD-DBSCAN [4] is an algorithm proposed by Irving Cordova and Teng-Sheng Moh in 2015. This algorithm addresses large datasets utility of DBSCAN as it is not efficient while working with Resilient Distributed Datasets, which are a fast data processing abstraction created directly for in-memory computation of large datasets. Experimentally, RDD-DBSCAN scales as the number of nodes in a given cluster scale, while generating the same results as the sequential version of DBSCAN. The main advantages of this study over DBSCAN [2] are:

        • Overcoming the scalability limitations by operating in a fully distributed fashion.

        • The algorithm scales as the number of nodes in a given cluster scales, while generating similar results as the sequential version of DBSCAN.

        • Improving in-computation memory and is more efficient.

          Future work can be done in the following areas:

        • Loading complete datasets for a given partition into memory.

        • Selection of better partitioning scheme for the data space.

        FastDBSCAN [5] is proposed by Vu Viet Thang,

          1. Pantiukhin and A.I. Galushkin in 2015. This algorithm divides the data into k partitions (using k- means), then uses a min-max method to select points for DBSCAN clustering. It is a hybrid clustering algorithm as it uses a combination of k-means, min-max method and DBSCAN to recover the final clusters and outliers. The main advantages of this study over DBSCAN [2] are:

            • Overcoming quadratic time complexity (O(n2))

              processed in DBSCAN.

            • Improves computational time and accuracy. Directions for further research are given as follows:

            • Developing graph based clustering based on k- means.

            • Applying the algorithm in intrusion detection system datasets.

        DDBSCAN [6] (Different Densities-Based Spatial Clustering of Applications with Noise) is a modified version of DBSCAN [2]. It uses new concepts to deal with spatial

        datasets. This algorithm was proposed by M.F. Hassanin,

        1. Hassan and Abdalla Shoeb in 2015. The idea behind this algorithm is to define a density factor to the cluster and the object then define a threshold parameter as decision criteria to determine whether joining this object or not. On applying this technique, any cluster will contain indistinguishable density nodes. The main advantages of this study over DBSCAN [1] are:

          • Multi-density cluster handling is enhanced.

          • Effective clustering of adjacent clusters.

          • Clustering of noise points amongst adjacent clusters.

            Dsets-DBSCAN [7] is a parameter free hybrid clustering algorithm proposed by Jian Hou, Huijun Gao and Xuelong Li. This algorithm is mainly used in data clustering and image segmentation experiments. The algorithm functions in two main steps. First, we apply histogram equalization to similarity matrices before using it in clustering, to eliminate the regulation parameter. This makes the algorithm parameter-free. The next step is to Run Dsets clustering followed by clustering based on DBSCAN and extract cluster sequentially. The main advantages of this study over DBSCAN [2] are:

          • Considering careful parameter tuning, this algorithm performs better than conventional DBSCAN.

          • Efficiency is increased with regards to data clustering.

        TABLE I.












        A modified

        Divides the data in

        Computing the density of a

        Application of histogram

        together closely

        determines the

        DBSCAN algorithm

        k partitioning

        cluster with respect to radius

        equalization to pairwise

        packed points,

        number of clusters

        which takes full

        (using k-means),

        value Eps and Min_pts.

        similarity of input data.

        called clusters.

        and derives suitable

        advantage of

        then using a min-

        Then provide density

        This makes the Dsets

        data-driven radii by

        Apache Sparks

        max method to

        threshold which is

        results independent of

        the use of recursive

        parallel capabilities

        select points for

        responsible for joining a

        user specified

        density equation.

        implemented in


        point to a certain cluster or

        parameters. Then extend

        Scala programming



        clusters from Dsets with

        language and run in


        top of Apache



        Finds arbitrarily

        No prior

        Addresses large


        Overcomes the problems of:

        Parameter-free clustering

        shaped clusters.

        knowledge of the

        datasets utility of

        quadratic time

        Multi-density cluster


        Robust to

        number of clusters

        DBSCAN as it is not



        Prevents over-


        is required.

        efficient while

        processed in

        Adjacent cluster discovery.

        segmentation of arbitrary

        No initial radi or

        working with RDDs.


        Noise points amongst


        other user input




        Effective in data


        scalability issues of

        computational time.


        No knowledge of

        the traditional

        Improves clustering

        the density of the

        DBSCAN algorithm


        data is required.

        by operating in a

        Fewer calculation

        fully distributed

        by the use of


        recursive density



        performance on


        Efficiency is

        Apache Spark

        significant with


        larger datasets and

        Improved memory

        big data.






        In smaller datasets

        Limitation is the

        Needs more

        Careful parameter tuning


        the time advantage

        need to load the

        research to interpret

        is required for optimum


        may be reduced

        complete dataset for

        these advantages of


        Prior knowledge

        due carrying out

        a given partition into


        of the number of

        the density


        clustering and

        clusters is

        calculation twice

        Selection of better



        per cluster when

        partitioning scheme

        clustering for

        Knowledge of

        adjusting the radii.

        for the data space is

        constructing hybrid

        data density

        Algorithm is non-




        iterative, assigns

        Improvement on n-


        each sample to the

        dimensional datasets


        appropriate cluster

        (n>1) is unknown.


        only once.


        Time complexity is


        dependent on the

        Noise points

        clustering data.

        amongst adjacent


        Adjacent cluster


        Uses heavy user




        Average runtime

        Varies. Depends on





        complexity :

        the size of the




        Worst case


        complexity :


        Non matrix



        complexity :







        3(Eps, Min_pts,

        3(Input dataset D,

        3 (Eps,Min_pts,Density

        3 parameters for initial




        number of cluster


        Histogram Equalization

        for k-means k,

        technique(Dataset D,

        proportion of data

        Eps, Min_pts), none for




        Fraud detection


        Mainly used in

        Intrusion detection




        Data Clustering.

        and uses

        systems, data

        sciences by using

        Apache Spark.

        system by





        Image segmentation


        Hyper Pole to Pole




        Experimenting on

        (HIPPO) datasets.

        various datasets.

        Large climate

        based datasets.


In this paper we have discussed and compared different modified DBSCAN algorithms. The algorithms discussed were DBSCAN, DDCAR, RDD-DBSCAN, FastDBSCAN,

DDBSCAN and DSets-DBSCAN. This evaluation was done due to many disadvantages which DBSCAN had regarding a number of its features. DDCAR and DSets- DBSCAN are parameter free clustering algorithms. They do not require a user input parameter. RDD-DBSCAN is effective while working with Resilient Distributed Datasets. FastDBSCAN improves computational time and accuracy by overcoming the quadratic time complexity possessed in the traditional DBSCAN. DDBSCAN combats the problem of multi-density clustering and generation of noise points amongst clusters. Thus we obtained a holistic view of the changes made by each of the algorithm.


  1. Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd Edition, 2012.

  2. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in Proc. Int. Conf. Knowl. Discovery Data Mining, 1996, pages 226231..

  3. Richard Hyde, Plamen Angelov, A Fully Autonomous Data Density Based Clustering Technique in Evolving and Autonomous Learning Systems (EALS), 2014 IEEE Symposium, Orlando, FL on 9-12 Dec. 2014,pages 116 – 123.

  4. Irving Cordova,Teng-Sheng Moh, DBSCAN on Resilient Distributed Datasets in High Performance Computing & Simulation (HPCS), 2015 International Conference, Amsterdam,20- 24 July 2015,pages 531 – 540.

  5. Vu Viet Thang ,D. V. Pantiukhin ,A. I. Galushkin, A Hybrid Clustering Algorithm: The FastDBSCAN in 2015 International Conference on Engineering and Telecommunication (EnT), Moscow, 18-19 Nov. 2015, pages 69 – 74.

  6. Mohammad F. Hassanin,Mohamed Hassan,Abdalla Shoeb, DDBSCAN: Different Densities-Based Spatial Clustering of Applications with Noise in 2015 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kumaracoil, 18-19 Dec. 2015, pages 401

    – 404.

  7. Jian Hou, Huijun Gao, Xuelong Li, DSets-DBSCAN: A Parameter- Free Clustering Algorithm in IEEE Transactions on Image Processing (Volume: 25, Issue: 7),2016, pages 3182 – 3193.

  8. Garima,Hina Gulati,P.K.Singh, Clustering Techniques in Data Mining: A Comparison in 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom),2015,Pages 410 – 415.

  9. Saif ur Rehman,Sohail Asghar,Simon Fong ,S. Sarasvady, DBSCAN: Past, present and future in Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference,Bangalore,17-19 Feb. 2014, pages 232 – 238.

  10. Wikipedia. (2016, May, 27) DBSCAN from Wikipedia, the free encyclopedia. [Online] Available at:

Leave a Reply