DBWGIE-MR: A density-based clustering algorithm by using the weighted grid and information entropy based on MapReduce

Abstract

The main target of this paper is to design a density-based clustering algorithm using the weighted grid and information entropy based on MapReduce, noted as DBWGIE-MR, to deal with the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density. This algorithm is implemented in three stages: data partitioning, local clustering, and global clustering. For each stage, we propose several strategies to improve the algorithm. In the first stage, based on the spatial distribution of data points, we propose an adaptive division strategy (ADG) to divide the grid adaptively. In the second stage, we design a weighted grid construction strategy (NE) which can strengthen the relevance between grids to improve the accuracy of clustering. Meanwhile, based on the weighted grid and information entropy, we design a density calculation strategy (WGIE) to calculate the density of the grid. And last, to improve the parallel efficiency, core clusters computing algorithm based on MapReduce (COMCORE-MR) are proposed to parallel compute the core clusters of the clustering algorithm. In the third stage, based on disjoint-set, we propose a core cluster merging algorithm (MECORE) to speed-up ratio the convergence of merged local clusters. Furthermore, based on MapReduce, a core clusters parallel merging algorithm (MECORE-MR) is proposed to get the clustering algorithm results faster, which improves the core clusters merging efficiency of the density-based clustering algorithm. We conduct the experiments on four synthetic clusters. Compared with H-DBSCAN, DBSCAN-MR and MR-VDBSCAN, the experimental results show that the DBWGIE-MR algorithm has higher stability and accuracy, and it takes less time in parallel clustering.

Keywords

Big data density-based clustering algorithm weighted grid information entropy

1 Introduction

In the current era, the need, growth, and expansion of data from different sources create challenges in their collection, processing, management, and analysis. Big data is the next generation of computation which tackles these challenges and obtain useful information from the data [14]. It has brought up a new perception and has been used by many researchers in recent years.

Data mining is a set of techniques to extract potentially useful information and knowledge from a large number of incomplete, noisy, fuzzy and random data. Clustering is one of the most popular tools in data mining, used to group a collection of physical or abstract objects into classes composed of similar objects, which has a substantial usage in statistical learning, artificial intelligence, and pattern recognition [8]. Clustering algorithm groups single and distinct points into clusters such that the members of the same cluster have the highest similarity with each other whilst the points in different clusters are significantly different. In clustering algorithms, density-based clustering algorithms, such as DBSCAN [15] and OPTICS [7], can find clusters of arbitrary shape and are not sensitive to noise, which have attracted a lot of research interest. However, the traditional density-based clustering algorithms cannot be directly used for big data owing to their high time complexity. Improving the existing density-based clustering algorithm and combining it with the distributed computing architecture for a lower computational complexity have become the leading research direction of current density clustering algorithms [3 , 17–22].

Many solutions have been proposed to reach this direction, the advent of MapReduce [22], Hadoop, and spark architectures make it possible [5]. Li et al. in the paper [6] first proposed a parallel DBSCAN Algorithm based on MapReduce. After dividing the data, this algorithm uses the MapReduce framework to parallel execute the DBSCAN algorithm to get local clusters. Then the global cluster is obtained by merging the local clusters incrementally. However, this algorithm did not propose an effective method for dividing the data, which leads to high computational complexity. Silva et al. [15] proposed a new efficient distributed strategy of DBSCAN that used MapReduce to detect dense areas according to the input parameters and merge clusters incrementally. It also has high computational complexity and a lower overall parallelization efficiency.

Dividing data effectively and merging local clusters have always been important research of the parallelization of density clustering algorithms [4]. Data gridding can divide spatial data into a finite number of units, and points falling into the same grid can be treated as an object, which can well solve the problem of data gridding [12]. Hence, Mahran et al. proposed the GriDBSCAN (Using Grid for Accelerating Density-Based Clustering) algorithm [13], which uniformly grids data. It uses grids as objects to execute the DBSCAN algorithm in parallel and merge these grid objects to get global clusters. But there are two obvious problems with the algorithm: when dividing the grid evenly, it is difficult to determine the size of the grids, and using regular grids, which may divide up data sets in high-density areas and create a large number of duplicate boundary points, it will seriously affect the clustering results. Besides, the incremental method is adopted to merge local clusters, resulting in low computational efficiency.

Literature [5, 17] proposed an H-DBSCAN algorithm based on Hadoop and an S-DBSCAN algorithm based on Spark. Based on dividing the data equally, they add the extension of the grid boundary to improve the accuracy of clustering results and the efficiency of local clustering. But it was not enough. For better efficiency and effect, Wang et al. in their paper [19] proposed an IP-DBSCAN (an incremental parallelization fast clustering algorithm). Based on the number of data points, the algorithm divided the space grid by dichotomy and combined the greedy algorithm to restructure the partition rationally to reasonably divide the data. It dealt with local clustering to obtain the merged candidate cluster sets. The candidate clusters of R* -tree indexes were merged to be judged and processed. An undirected acyclic graph model of the merged clusters was established, and the data was globally re-labeled. However, the IP-DBSCAN algorithm had two obvious shortcomings. On one hand, it is necessary to input the threshold of grid edge length when the algorithm uses dichotomy to divide data. The different thresholds will affect the accuracy of clustering results, which results in low accuracy. On the other hand, the computational complexity of local clustering is high, and the parallelization strategy is not adopted when combining local clusters, so the overall parallelization efficiency of the algorithm needs to be improved.

Aiming at the problems of unreasonable partition of data and low accuracy of clustering, Dai and Li [2] proposed a method, partition with reduce boundary points (PRBP), to select partition boundaries based on the distribution of data points to reach load balance of each node, meanwhile, they proposed the DBSCAN-MR algorithm with the design of PRBP. Bhardwaj and Dash [9] introduced density level partitioning (DLP) into DBSCAN-MR and proposed the VDMR-DBSCAN algorithm. Use of their merging strategy, which can identify clusters with different densities. Heidari et al. [11] proposed the MR-VDBSCAN algorithm, which improved the accuracy and the speed of local clustering compared with VDMR-DBSCAN. However, all these algorithms do not solve the problem of low efficiency in merging local cluster.

To overcome the above limitations, we take the DBSCAN algorithm as the prototype, propose a density-based clustering algorithm using a weighted grid and information entropy based on MapReduce, noted as DBWGIE-MR. The main contributions of our work include:

We propose an ADG (adaptive divide grid) strategy to divide the grid adaptively according to the spatial distribution of data points.

For each data partition, we propose NE (Neighboring Expand) strategy to construct its weighted grid to strengthen the relevance between grids to improve the clustering effect. Meanwhile, we propose the WGIE (weighted grid and information entropy) strategy to calculate the grid density, the ɛ-neighborhood, and core objects of the density clustering algorithm, so that the density clustering algorithm is more suitable for the weighted grid. Then, combining with the MapReduce, we propose COMCORE-MR (Computing Core Cluster by using MapReduce) strategy to solve the problem of low computing efficiency for local clusters in the parallel density clustering algorithm.

After the local cluster is formed, we propose the disjoint-set merging algorithm MECORE (Merge Core Cluster) based on the disjoint-set to accelerate the convergence rate of the disjoint local cluster. Then combined with the MapReduce, we propose a parallel local cluster merging algorithm MECORE-MR (Merge Core Cluster by using MapReduce) based on MapReduce to realize the parallel merging of local clusters, to improve the overall parallelization efficiency. The global cluster of clustering results can be obtained more quickly when merging local clusters in parallel, which improves the efficiency of the density-based clustering algorithm.

The rest of the paper is organized as follows. Section 2 introduces some basic concepts and the background of MapReduce and DBSCAN. Section 3 introduces the detailed design and implementation of DBWGIE-MR. Section 4 presents the experiment settings and results. Section 5 concludes the paper.

2 Preliminary

2.1 MapReduce

MapReduce [22] is a programming model for parallel computing of large data sets (larger than 1TB). “Map” and “reduce” are its main ideas. MapReduce can automatically divide the big data to be processed into many data blocks, and automatically schedules the computing nodes to process the corresponding data blocks. It greatly facilitates programmers to run their own programs on distributed systems without parallel programming.

In MapReduce, data is represented as (key, value) pairs. A job in MapReduce consists three stages: map, shuffle, and reduce. For each input pair (k1, v1), several output pairs list (k2, v2) are generated based on the (k1, v1) in the map phase, and partitioned and transferred to reducers in the shuffle phase, while in the reduce phase, pairs with the same key are grouped as (k2, list(v2)). At last, the reduce function generates the final output pairs list (k3, v3) for each group.

2.2 DBSCAN

DBSCAN [15] is a typical density-based clustering algorithm. It defines a cluster as the largest set of density-connected points, which can find all the dense area of data points and divide them into clusters, it does not need the number of clusters and the process is not affected by noise. DBSCAN has two important parameters: Eps and MinPts. Eps is the neighborhood radius of each object, and MinPts is the minimum points in the neighborhood of each core object. In the DBSCAN algorithm, there are three categories of data points:

Definition 1. (Core point) Core point contains more than MinPts points in its radius Eps.

Definition 2. (Boundary point) Boundary point is the point in the neighborhood of the core point, but which contain less than MinPts points in its radius Eps.

Definition 3. (noise point) A point is neither a core point nor a boundary point, it is a noise point.

Generally, the core point corresponds to the point inside the dense area, the boundary point corresponds to the point at the edge of the dense area, and the noise point corresponds to the point in the sparse area.

2.3 Uniform data gridding

Traditional grid-based clustering algorithms, such as the STING, Wave Cluster, and CLIQUE, all use the uniform data gridding [12] method to divide the data space. It can be described as follows: Considering a d-dimensional space, dividing each dimension into n intervals that have the same size and disjoint from each other. Therefore, the whole data space is divided into n^d equal grids, as is shown in Fig. 1, when D = 2, n = 3, the two-dimensional data space will be divided into 9 equal grids.

Fig. 1

Uniform data gridding.

2.4 Weighted grid

Weighted grid [10] construction strategy (NE) can strengthen the relevance between grids due to the weights added between the grids, the definition is as follows:

Definition 4. (Weighted grid) Given a grid G, the set of the other grids associated with G is denoted by N (G), W = w (G, N (G)) represents the weights of G and the other grids associated with G. The weighted grid can be defined as WG = (G, N (G) , W).

2.5 Information entropy

Information entropy [16], proposed by Dr. Shannon in 1948, describes the average uncertainty of a random variable or its probability distribution. For a discrete variable X, the definition of information entropy of which is shown below:

Definition 5. (Information entropy) given a discrete variable X, x is an element of X, P (x) is the probability that x appears in the system event. The information entropy of X can be defined as: $H (X) = - \sum P (x) \times {log}_{2} P (x)$ (1)

2.6 Disjoint-set

Disjoint-Set [1] is a kind of data structure that can dynamically maintain several non-overlapping sets and support the operations of merge and query. Disjoint-set uses a separate tree to indicate each set, the root node of the tree represents the set, and each leaf node of the tree represents an element in the set. There are three steps to combine the disjoint dynamic sets X = {x₁, x₂, . . . , x_n} and Y = {y₁, y₂, . . . , y_n} with a disjoint-set, the whole process can be summarized as follows:

makeset (X, Y): Creates a new disjoint-set for X and Y separately, which contains n single-element sets.

find (x): Returns the representation of the set which element X resides in.

unionset (x, y): If the set where element x and y reside in are not intersected, merging them.

3 Algorithm DBWGIE-MR

DBWGIE-MR consists of three stages: data partitioning, local clustering, and global merging. In stage 1, the grid and ADG strategy were adopted to adaptively divide the whole dataset into smaller partitions according to spatial proximity. After data partitioning, the data was divided into local and border areas, thus providing the conditions for the later cluster-merging procedure. In stage 2, each partition is clustered independently, NE strategy, WGIE strategy, and COMCORE-MR algorithm were adopted to get local clusters. This stage is the dominant part of the whole process in terms of computation time. The slowest local clustering task decides the performance of this stage. In the last stage, the MECORE algorithm and MECORE-MR algorithm were adopted to get global merging. An overview of the DBWGIE-MR algorithm is shown in Fig. 2.

Fig. 2

An overview of DBWGIE-MR algorithm.

3.1 Data partitioning

The purpose of data partitioning is to divide a complete large data set into small pieces that can be processed independently. Most of the existing partition methods focus on evenly dividing the data so that the number of points in each partition is as uniform as possible, such as the methods of grid-based, KD-tree-based, and binary-tree-based and so on. However, they do not work well because the initial side length of the grid is difficult to determine, and the density of grid data is uneven. For these issues, we use the ADG strategy to divide data into grids adaptively. The principle of ADG strategy is as follows:

Firstly, dividing the d-dimensional data space into 2^d initial grids equally, then, we use the minimum average distance between data points in the current grid, and the number of data points in the current grid to calculate the division threshold φ of the edge length of the grid. Keep equally dividing the current non-empty grids, if the current edge length of the current non-empty grid is longer than φ, we stop dividing it. The definition of the threshold φ is shown below:

Definition 6. (threshold φ) Known p_i, p_j (1 ⩽ i, j ⩽ η) are any two points in the current grid, ∥p_i - p_j∥ is the distance between p_i and p_j, η is the number of points in the current grid. The threshold φ can be defined as: $φ = η \times m i n (\frac{1}{η} {\sum_{i = 1, i \neq j}}_{1 ⩽ j ⩽ η}^{η} ‖ p_{i} - p_{j} ‖)$ (2)

Proof. Let the length of the current non-empty grid be L. Let φ = L, i.e., $η \times m i n (\frac{1}{η} {\sum_{i = 1}}_{i \neq j}^{η} ‖ p_{i} - p_{j} ‖) = L$ , get η = η₀, $m i n (\frac{1}{η} {\sum_{i = 1}}_{i \neq j}^{η} ‖ p_{i} - p_{j} ‖) = m_{0}$ at this time.

When φ ⩾ L, i.e., $η \times m i n (\frac{1}{η} {\sum_{i = 1}}_{i \neq j}^{η} ‖ p_{i} - p_{j} ‖) ⩾ L$ , we can get φ ⩾ φ₀ or $m i n (\frac{1}{η} {\sum_{i = 1}}_{i \neq j}^{η} ‖ p_{i} - p_{j} ‖) ⩾ m_{0}$ , it means there are too many data points in the current grid or the data distribution is too sparse, so the grid needs to be subdivided;

When φ < L, i.e., $η \times m i n (\frac{1}{η} {\sum_{i = 1}}_{i \neq j}^{η} ‖ p_{i} - p_{j} ‖) < L$ , we can get φ < φ₀ or $m i n (\frac{1}{η} {\sum_{i = 1}}_{i \neq j}^{η} ‖ p_{i} - p_{j} ‖) < m_{0}$ , it means there are not too many data points in the current grid or the data distribution is dense, so we can stop dividing the grid.

As shown in Fig. 3. The density of the grids divided by ADG strategy is very uniform, which is conducive to reach load balance of MapReduce nodes and improves the stability of clustering.

Fig. 3

Grid division by using ADG strategy.

3.2 Local clustering

The local clustering stage performs the clustering algorithm for each data partition separately and saves the local clusters as intermediate results. In this phase, any data slices split in the last step of the first phase are distributed to the executor through the task scheduler for local DBSCAN cluster calculations. The process is divided into three parts: building weighted grid, calculate grid density, and get local clustering.

3.2.1 Building weighted grid

After gridding data in stage 1, the NE strategy was adopted to build the weighted grid. Before which, several parameters should be determined in advance. The scope of the weighted grid is the parameter that needs to be determined first. And the weight of the weighted grid is the second parameter that needs to be set. The definitions of the parameters are as follows:

Based on the neighbor grid of the Grid Object [20] and the principle of grid boundary expansion [5], respectively, these parameters are defined as follows:

Definition 7. (scope of the weighted grid) Given a grid G_{s₁,...,s_d}. The scope of the weighted grid of G_{s₁,...,s_d} can be defined as: ${N (G_{s_{1}, . . ., s_{d}}) | \forall i 1 pt 1 pt 1 pt s . t . 1 pt 1 pt 1 pt 1 pt 1 ⩽ i ⩽ d, | s_{i}^{'} - s_{i} | ⩽ 1}$ (3)

Where N (G_{s₁,...,s_d}) is the set of grids within the range of the weighted grid, G_{s₁,...,s_d} is a grid, $G_{s_{1}^{'}, . . ., s_{d}^{'}}$ is any grid in the weighted grid of G_{s₁,...,s_d}, s_i, $s_{i}^{'}$ represents the ith grid in a dimension.

Definition 8. (weight of the weighted grid) Given a point p which in G_{s₁,...,s_d}, if $p \in G_{s_{1}^{'}, . . ., s_{d}^{'}}$ and $G_{s_{1}^{'}, . . ., s_{d}^{'}} \in N (G_{s_{1}, . . ., s_{d}})$ , set $W_{s_{1}, . . ., s_{d}}^{s_{1}^{'}, . . ., s_{d}^{'}} = 1$ , otherwise, set it to 0. Where $G_{s_{1}^{'}, . . ., s_{d}^{'}}$ is any grid in the weighted grid of grid G_{s₁,...,s_d}, $W_{s_{1}, . . ., s_{d}}^{s_{1}^{'}, . . ., s_{d}^{'}}$ is the weight of $G_{s_{1}^{'}, . . ., s_{d}^{'}}$ relative to G_{s₁,...,s_d}.

When $p \in G_{s_{1}^{'}, . . ., s_{d}^{'}}$ and $G_{s_{1}^{'}, . . ., s_{d}^{'}} \in N (G_{s_{1}, . . ., s_{d}})$ , it means p is on the boundary of two grid objects, then the data between grid objects must be relevant, so we set $W_{s_{1}, . . ., s_{d}}^{s_{1}^{'}, . . ., s_{d}^{'}} = 1$ , otherwise set $W_{s_{1}, . . ., s_{d}}^{s_{1}^{'}, . . ., s_{d}^{'}} = 0$ , i.e., there is no relevance between grid objects.

The following example is based on Definition 7. In Fig. 4, there are 16 grids in the data grid, and the dimension of data is 2. It demonstrates the procedure of building the weighted grid. By Definition 7, the range of action of the weighted grid object is formed within the range of $| s_{i}^{'} - s_{i} | ⩽ 1$ , i.e., there are 9 grids in the grid object G_{s₁±1,s₂±1}. Then $W_{s_{1}, s_{2}}^{s_{1} - 1, s_{2}} = W_{s_{1}, s_{2}}^{s_{1}, s_{2} + 1} = W_{s_{1}, s_{2}}^{s_{1} + 1, s_{2}} = 1$ is calculated by Definition 8. Accordingly, the weighted grid of G_s₁,s₂ can be denoted by $WG = (G_{s_{1}, s_{2}}, G_{s_{1} - 1, s_{2}}, G_{s_{1}, s_{2} + 1}, G_{s_{1} + 1, s_{2}}, W_{s_{1}, s_{2}}^{s_{1} - 1, s_{2}} = W_{s_{1}, s_{2}}^{s_{1}, s_{2} + 1}$ $= W_{s_{1}, s_{2}}^{s_{1} + 1, s_{2}} = 1)$ .

Fig. 4

Building the weighted grid.

3.2.2 Calculate grid density

Consider a certain correlation between grid objects based on the weighted grid. It is unreasonable to use the number of data points in the grid to calculate the grid density. For this, in this section, we propose WGIE (weighted grid and information entropy) strategy to calculate the density of grids, the ɛ-neighborhood and core object of density clustering algorithm are also redefined. The definition of the grid density calculated by WGIE strategy is shown below:

Definition 9. (grid density calculated by WGIE) Known t is the density of a non-empty grid after data gridding, i.e., the number of all data points in the weighted grid with the grid as the center. The grid density calculated by WGIE strategy can be defined as follows: $H^{'} (X) = - \sum_{i = 1}^{x} P (t) \times {log}_{2} P (t)$ (4) $P (t) = P (density (g) = t) = \frac{count (t)}{count (n)}$ (5)

Where x is the number of grids with this density; P (t) is the probability in which the grid density is t; count (t) refers to the number of grids in which the grid density is t; count (n) represents the total number of non-empty grids after partition.

Proof. There are three cases. (1) (Monotonicity) For ∀t₁, t₂ and t₁ - t₂ > 0, when P (t₁) - P (t₂) >0, H′ (P (t₁)) - H′ (P (t₂)) <0. (2) (Nonnegativity) Because 0 < P (t), log ₂P (t) <0, $0 < - \sum_{i = 1}^{x} P (t) \times {log}_{2} P (t)$ , that is H′ (X) >0. (3) (Summative) For ∀t₁, t₂ ∈ t, H′ (P (t)) = H′ (P (t₁, t₂)) = H′ (P (t₁) · P (t₂)) = H′ (P (t₁)) + H′ (P (t₂)).

It can be seen from the above that the formula meets the basic conditions of the definition of information entropy, which can be used to evaluate the stability of the system.

According to the scope of the weighted grid and the information entropy, we redefine ɛ-neighborhood and core object. Because the core object is closely related to the density of grids, the weighted grid and information entropy strategy can accurately compute the density value of grid objects. When the density H ’(x) of the grid is smaller than the given density threshold μ, it means that the data in the weighted grid centered on the grid is ordered. It is better to center on the grid, and the grid will also have a large probability of becoming the core object. The ɛ-neighborhood and core object are defined as follows:

Definition 10. (ɛ-neighborhood) for a grid object g_i, when we build a weighted grid centered on it, all grid objects in the range of the weighted grid are neighborhoods of g_i.

Definition 11. (Core objects) For a grid object g_i, if its density meets H ’(x)≤μ (the information entropy of the weighted grid is smaller than the given threshold), then the grid object g_i will be viewed as the core grid object.

3.2.3 Get local clustering

The efficiency of local clustering also determines the performance of the algorithm. In this case, we design the COMCORE-MR algorithm for parallel computing local clusters. It contains two steps: Parallel computing grid density and Parallel computing local cluster.

Step 1: (Parallel computing grid density) first, input the grid object g and the point p_i in the grid. Execute the map function to calculate C_i [g], i.e., the number of points in the weighted grid centered on the grid object g, and output key-value <g, C_i [g]>. After that, we use the reduce function to merge the results from map function; meanwhile, WGIE strategy is used to calculate the grid density h_i of each grid object. Finally, output the key-value < (g, N (g_i)) , h_i> to the next stage, the whole progress is shown in Fig. 5(a).

Step 2: (Parallel computing local cluster) In this section, p_i in dataset D, and < (g, N (g_i)) , h_i> from the last stage are needed. The map function is used to calculate them. In this part, we will execute the map function according to different inputs. If we input p_i, the map function will calculate the grid object g corresponding to each data point and output the key value <g, p_i>, else if we input key-value < (g, N (g_i)) , h_i>, then the map function will determine if the current grid object g is the core grid or not, based on Definition 2. If h_i ⩽ μ, then g is the core grid, and output key-value <g, N (g_i)>, else, no result will be output. Finally, execute the reduce function, merging the results, and output the key-value < ((g, N (g_i)) , N (p_i)>. The final result is the sequence set of core clusters, i.e., the local cluster of clustering results that we need, the whole process is shown in Fig. 5(b).

Fig. 5

The calculation process of COMCORE-MR algorithm.

3.3 Global merging

In this phase, the generated local cluster are renamed, and each cluster has only one global cluster number. In this way, the final clustering results are generated. The last step of clustering is to merge local clusters into global clusters, which can be divided into two steps: Merging local clusters and parallel merging local clusters. In the first step, we introduce the methods of local cluster merging, and in the second step, some new strategies based on Map-Reduce will be proposed for the parallel merging of local clusters. We will explain them in detail below.

3.3.1 Merging local clusters

According to the merging method of two disjoint sets, MECORE includes three methods of merging different grid objects based on disjoint-set: Make-set, Find, Union-set.

Make-set: treat each grid object as a single leaf node.

Find: connect the grid object nodes in the same local cluster and return a tree represented by root node, the core grid object of the cluster is the root node of the tree, while the other grid objects in the local cluster are the leaf nodes, they are connected with the root node.

Union-set: combine two different local clusters to look for common leaf nodes, and convert the root node of one tree into the leaf node of another tree.

Then use these three methods to merge the local clusters. The overall steps of the MECORE algorithm are as follows:

Step 1: Draw all local cluster objects into a table R, including their core grid objects g and the neighborhood N (g) of the core grid. Take table R as the input of the algorithm.

Step 2: Initialize each non-empty grid object in table R as a separate cluster, and each grid object’s state in table R is initialized as unvisited.

Step 3: After the algorithm is executing, the state of each grid object will become one of unvisited, border and core.

Step 4: Retrieve the key-value <g, N (g_i)> of each core grid object, change its state from unvisited to core

Step 5: After that, set the state of grid object that in N (g), this process is divided into the following cases:

Case 1 (border): it means that the current grid object g_i has been assigned to another cluster, so the state of the grid object g_i remained.

Case 2 (core): merge the local clusters with g_i as the core into the local clusters of g.

Case 3 (unvisited): add it into the local cluster with g as the core, and change the state of g_i to border.

After algorithm execution, according to the corresponding data points and grid ID, we can get the global clustering, and the data points marked as unvisited in the grid objects are outliers. The algorithm merging of local clusters shows in Algorithm 1.

Algorithm 1 Merging of local clusters

grid object set G after data partition, table R composed of local clusters.

global clusters

Function Merge (G, R)

For each g ∈ G

g.state=unvisited

Make-set (g)

end for

For each <g, N (g_i) > ∈ R

g.state=core

For each g_i ∈ N (g_i)

If g_i.state=core

Union-set (g, g_i)

Else if g_i.state=unvisited

g_i.state=border

Union-set (g, g_i)

end for

For each g ∈ G

If g.state=unvisited

g.state=Outlier

Else return (g)

end for

3.3.2 Parallel merging local clusters

The local cluster merging algorithm based on disjoint-set can merge the local clusters to get global clusters. However, the parallel clustering algorithm based on density does not merge local clusters in parallel. For that, MECORE-MR based on MapReduce is proposed. The steps are as follows:

Step 1: Take the network object set G, data set D, and table R as the input of the algorithm. The data in table R is the data of local clusters calculated by the COMCORE-MR algorithm.

Step 2: Randomly divide the grid object-set G into k parts (G₁, G₂, . . . , G_k) with a similar number, which the k is the number of parallel nodes.

Step 3: Meanwhile, the R is also divided into k parts (R₁, R₂, . . . , R_k).

Step 4: Execute the map function. If the input is a data point p_i ∈ D, then use the map function to calculate the grid object g corresponding to each data point and output the key-value <g, p_i>.

Step 5: Else if the input data is the local cluster data in table R, retrieve the key-value <g, p_i> of the core grid object in the local cluster. And index G₁, G₂, . . . , G_k according to the key-value g, get the corresponding k value, assign the key-value to the corresponding R_k, and output the key value <M_i, (g, N (g_i))> to the Reduce function.

Step 6: Execute the Reduce function, for each M_i, execute the MECORE algorithm in parallel, and get k merging results.

Step 7: Merge k results again use the same algorithm, then combining with <g, p_i>, the global cluster is obtained. The algorithm parallel merge of local clusters shows in Algorithm 2.

Algorithm 2 Parallel merge of local clusters

Grid object-set G after data partition, table R composed of local clusters, dataset D

global cluster

Function MECORE-MR (G, R, D)

k=Count (Machine)

G₁, G₂, . . . , G_k=Partial (G, k)

Result=RunMapReduce (G, R, D, G₁, G₂, . . . , G_k) {

For each p_i ∈ D

MapReduce. Map (p_i, G)

g=Grid (p_i, G)

emit (<g, p_i>)

end map

end for

For each <g, N (g_i) > ∈ R

MapReduce. Map (g, G₁, G₂, . . . , G_k) {

i = Partial_Id (g, G₁, G₂, . . . , G_k);

R₁, R₂, . . . , R_k = Partial (R, i);

M_i = R_i;

emit (<M_i, (g, N (g_i))>);

end map

end for

For each M_i∈ < M_i, (g, N (g_i)) >

MapReduce.Reduce (G_i,(g, N (g_i))){

M_i = MECORE (G_i, (g, N (g_i)));

end Reduce

M = MECORE (M_i, M_i+1);

end for

Result=Point (M, <g, p_i>);

Return (Result); end Run MapReduce

3.4 Procedures of DBWGIE-MR

The specific implementation steps of the DBWGIE-MR algorithm shown in Algorithm 3.

Algorithm 3 DBWGIE-MR

point set D of data space, the dimension d of data space

global cluster

Initialization parameters:

Density threshold μ of grid objects

The data space is divided into 2^d initial grids

For each p_i ∈ D

Execute ADG strategy to get the divided data grid-set G

end for

For each g_i ∈ G

Execute NE strategy to build the weighted grid, and get the weighted grid of each grid object WG (g_i, N (g_i))

end for

For each g∈ < (g_i, N (g_i)) , h_i >

Call COMCORE-MR algorithm and output key-value sequence < (g_i, N (g_i)) , N (p_i)>

Make the key-value sequence < (g_i, N (g_i)) , N (p_i)> into a table R of the local cluster

Return (R)

end for

Call parallel merging local clustering algorithm MECORE-MR (G, R, D) to get Cluster global cluster Result

Return (Result)

3.5 Time complexity of the algorithm

Time complexity can measure the performance of the algorithm. The time complexity of DBWGIE-MR depends on the gridding of data, the construction of the weighted grid, the parallel calculation of grid density, the parallel calculation of local clusters, and the parallel combination of local clusters. The time complexity of these steps are as follows: 1) Assuming the number of data points in the space is n, the time complexity of data grid using ADG strategy is O (n²); 2) Assuming the number of non-empty data grids is k, the time complexity of using NE strategy to build weighted grids is O (k²); 3) Under the MapReduce, assuming that the number of distributed machines that executing functions is m, the time complexity of parallel computing of local clusters by using COMCORE-MR algorithm is O ((n + k)/m); 4) The time complexity of parallel merging algorithm for local clusters of COMCORE-MR algorithm is O ((n + k³ + m²)/m).

In summary, the time complexity of the DBWGIE-MR algorithm is O ((n + k³ + m²)/m + n²) because of K < n, m < n, which one is approximate O (n³/m + n²).

4 Evaluation

4.1 Experiment settings

We do the experiments on a Master machine and three Slaver machines. Each machine is equipped with a single quad-core Intel Core i5-9400 H CPU @ 2.9 GHz processor, 16 GB DRAM memory, and 1 TB SATA3 7200RPM hard disk. The operating system is Ubuntu Linux 16.04. The software programming environment is python3.5.2. For the MapReduce platform, we choose the Apache Hadoop3.2.

4.2 Data sources

The experimental data of the DBWGIE-MR algorithm are 4 real datasets from UCI [23] public database, which are Iris, Uscd1990, Susy and Hepmass. Iris is the most famous data set for pattern recognition, including 4 attributes and 150 data points; Uscd1990 is the census dataset of the United States in 1990, which including 68 attributes and with a total of 2458285 records; Susy is a dataset for recording the detection data of supersymmetric particles, including 18 attributes and 5000000 records; Hepmass is a dataset that records the signatures of exotic particles, which including 28 attributes and 10500000 data points in total. The details of the datasets are shown in Table 1.

Table 1
Details of the datasets

Datasets Attributes Records Sizes (MB)

Iris 4 150 0.86

Uscd1990 68 2458285 345

Susy 18 5000000 880.5

Hepmass 28 10500000 2478.4

Datasets	Attributes	Records	Sizes (MB)
Iris	4	150	0.86
Uscd1990	68	2458285	345
Susy	18	5000000	880.5
Hepmass	28	10500000	2478.4

4.3 Evaluating indicator

4.3.1 F-measure

Parameter setting and an appropriate evaluating indicator are particularly important. To quantitatively appraise the approach’s outputs, we use the fitness measure (F-measure) to evaluate the results of the clustering algorithm, which is the weighted average of precision and recall.

The definition of F-measure is as shown in Equation (6): $F - measure = \frac{(λ^{2} + 1) precision \times recall}{λ^{2} precision + recall}$ (6)

Generally, λ is set to 1, F-Measure comprehensively considers the precision and recall of clustering results, which can evaluate the results of the clustering algorithm more accurately. When the value of F-Measure is higher, it means the results are more accurate and reasonable.

4.3.2 ANOVA (Analysis of Variance)

Analysis of variance can be used to determine whether there are significant differences on the observed data or processing results of several groups. We use F-statistics (F) to evaluate the improvement of our algorithm, F-statistics is the ratio of mean squared between (MSB) and mean squared error (MSE), the freedom of which is k - 1 and N - k respectively. The definition of F-statistics is as shown in Equation (7): $F = \frac{MSB}{MSE} \sim F (k - 1, N - k)$ (7)

4.4 Parameter selection

As mentioned in the previous section, to apply the DBWGIE-MR algorithm, several parameters must be set. Because these parameters depend on the features of the data set, the accuracy of the resulting clustering is directly dependent on the user’s choice of parameters. For this reason, an empirical study was undertaken on the selected parameter to optimize these parameters for optimal clustering results. In this study, based on the Iris dataset, the other parameters remain unchanged, initial values of [1 12] for density threshold μ were assigned. Run independently for 10 times, and take the mean value of F-Measure for 10 times for analysis. It can be seen from Fig. 6 that when the value of density threshold μ is too large or too small it will affect the accuracy of clustering results. The experimental results show that when μ is taken as 4, more accurate clustering results can be obtained.

Fig. 6

F-measure.

4.5 Performance analysis of DBWGIE-MR

The ability of DBWGIE-MR algorithm should be evaluated from different aspects. We conducted 10 experiments on the datasets based on Iris, Uscd1990, Susy, and Hepmass. According to the accuracy and variance of clustering results, we compared the performance with H-DBSCAN, DBSCAN-MR and MR-VDBSCAN algorithms respectively. And we also calculate F-statistics to evaluate whether there are significant differences between DBWGIE-MR and the other algorithms on different datasets. The means of 10 experimental results are shown in Table 2, the detailed analysis is given in the following sections.

Table 2
Comparative analysis of clustering results of each algorithm

Dataset Algorithm Accuracy/% Variance F-statistics

DBSCAN-MR 97.3 1.731E-7 2.631E+6

MR-VDBSCAN 97.4 1.657E-7 2.130E+6

Iris H-DBSCAN 97.8 2.315E-7 4.368E+5

DBWGIE-MR 98.2 1.348E-7 9.697E+5

DBSCAN-MR 91.2 1.769E-3 1.081E+3

MR-VDBSCAN 93.8 1.716E-3 1.491E+2

Uscd1990 H-DBSCAN 92.4 1.992E-3 1.375E+2

DBWGIE-MR 93.1 1.571E-3 6.976E+2

DBSCAN-MR 87.6 1.436E-3 6.199E+3

MR-VDBSCAN 89.2 1.512E-3 2.012E+3

Susy H-DBSCAN 89.8 1.672E-3 9.979E+2

DBWGIE-MR 91.4 8.934E-4 1.790E+3

DBSCAN-MR 86.1 1.871E-3 7.094E+3

MR-VDBSCAN 87.8 1.914E-3 2.706E+3

Hepmass H-DBSCAN 88.4 2.018E-3 1.613E+3

DBWGIE-MR 90.6 9.834E-4 2.037E+3

Dataset	Algorithm	Accuracy/%	Variance	F-statistics
	DBSCAN-MR	97.3	1.731E-7	2.631E+6
	MR-VDBSCAN	97.4	1.657E-7	2.130E+6
Iris	H-DBSCAN	97.8	2.315E-7	4.368E+5
	DBWGIE-MR	98.2	1.348E-7	9.697E+5
	DBSCAN-MR	91.2	1.769E-3	1.081E+3
	MR-VDBSCAN	93.8	1.716E-3	1.491E+2
Uscd1990	H-DBSCAN	92.4	1.992E-3	1.375E+2
	DBWGIE-MR	93.1	1.571E-3	6.976E+2
	DBSCAN-MR	87.6	1.436E-3	6.199E+3
	MR-VDBSCAN	89.2	1.512E-3	2.012E+3
Susy	H-DBSCAN	89.8	1.672E-3	9.979E+2
	DBWGIE-MR	91.4	8.934E-4	1.790E+3
	DBSCAN-MR	86.1	1.871E-3	7.094E+3
	MR-VDBSCAN	87.8	1.914E-3	2.706E+3
Hepmass	H-DBSCAN	88.4	2.018E-3	1.613E+3
	DBWGIE-MR	90.6	9.834E-4	2.037E+3

4.5.1 Analysis of accuracy

The accuracy can directly reflect the clustering results of the algorithm, which is, the higher the better. The F-measure was adopted to calculate the accuracy of the algorithm and the mean of 10 experiments was used to compare the accuracy of DBWGIE-MR with that of H-DBSCAN, DBSCAN-MR and MR-VDBSCAN algorithms respectively. The experimental results are shown in Fig. 7.

Fig. 7

Accuracy of different algorithms in different datasets.

As shown in Fig. 7 that DBWGIE-MR algorithm has a higher accuracy than that of the other algorithms on most datasets. On the Iris dataset, the accuracy of the DBWGIE-MR algorithm is 0.4%, 0.9% and 1.2% higher than that of the H-DBSCAN, MR-VDBSCAN and DBSCAN-MR algorithms, respectively; On the Susy dataset, the accuracy of the DBWGIE-MR algorithm is 1.7%, 2.5% and 4.3% higher than that of the H-DBSCAN, MR-VDBSCAN and DBSCAN-MR algorithms, respectively; On the Hepmass dataset, the accuracy of the DBWGIE-MR algorithm is 2.4%, 3.2% and 6.3% higher than that of the H-DBSCAN, MR-VDBSCAN and DBSCAN-MR algorithms, respectively. That is because DBWGIE-MR takes the ADG strategy to reasonably divide the grid, and in the local clustering stage, the NE strategy and WGIE strategy are adopted to build the weighted grid for each grid object, then, the weighted grid is used for clustering, which greatly improves the accuracy of clustering. H-DBSCAN algorithm expands the boundary of the grid on the basis of uniform gridding, which improves the accuracy of the clustering results, but evenly dividing makes it is not as accurate as DBWGIE-MR algorithm. DBSCAN-MR algorithm use the PRBP method to divide the data, but it does not propose new strategies to improve the accuracy of the local clustering, which attains the lowest accuracy among the all algorithms. MR-VDBSCAN algorithm uses different density parameters to cluster in different partitions after dividing with PRBP algorithm, which improves the ability to identify the clusters with different densities and helps to improve the accuracy of clustering. Therefore, the accuracy of MR-VDBSCAN algorithm clustering is higher than that of DBSCAN-MR algorithm. Especially on the Uscd1990 dataset, the ability of MR-VDBSCAN algorithm to recognize clusters of different densities makes its accuracy 1.1%, 1.9% and 2.8% higher than DBWGIE-MR, H-DBSCAN and DBSCAN-MR algorithms. Although MR-VDBSCAN algorithm can recognize clusters of different densities, the DBWGIE-MR algorithm performs better on most datasets.

4.5.2 Analysis of variance

The variance of accuracy can show the stability of the algorithm, the smaller the value, the more stable the algorithm. We recorded the variance of the accuracy of the four algorithms in 10 experiments and take the mean to compare the stability of DBWGIE-MR with that of H-DBSCAN, DBSCAN-MR and MR-VDBSCAN algorithms respectively. The results of the variance are shown in Fig. 8.

Fig. 8

Variance on different datasets.

As shown in Fig. 8 that DBWGIE-MR algorithm has a smaller variance than that of the other algorithms on the 4 datasets and it has a decisive advantage on complex datasets. When clustering on a simple small dataset as the Iris in Fig. 8, the variance of DBWGIE-MR is similar to the others. However, when clustering on a large and complex dataset, such as the Susy, the variance of DBWGIE-MR algorithm is 47%, 31% and 30% lower than that of H-DBSCAN, MR-VDBSCAN and DBSCAN-MR algorithms. On the Hepmass, the variance of BWGIE-MR algorithm is 52%, 45% and 44% lower than that of H-DBSCAN, MR-VDBSCAN and DBSCAN-MR algorithms. This is because a simple small dataset will not have a great impact on the clustering algorithm, but the large and complex dataset will affect the performance of the algorithm due to its complex structure. DBWGIE-MR algorithm adopt ADG strategy to divide grid, and use NE and WGIE strategy to enhance the relationship between grids in clustering, which can make the algorithm more stable. The stability of DBSCAN-MR algorithm is similar to that of MR-VDBSCAN algorithm, because they all use PRBP algorithm to divide data which can generate fewer boundary points, but the effects are not very significant as that of DBWGIE-MR algorithm. H-DBSCAN algorithm expands the boundary of the grid on the basis of uniform gridding, which makes its accuracy is better than that of MR-VDBSCAN and DBWGIE-MR algorithm, but the uniform grid division affects its stability, so that its stability is not as good as DBSCAN-MR and MR-VDBSCAN algorithms. Therefore, the experimental results show that DBWGIE-MR has the best stability on different datasets.

4.5.3 Analysis of F-statistics

We made the assumption H₀: “the average values of the clustering results of the four algorithms are equal, namely μ_a = μ_b = μ_c = μ_d”, then calculated the F-statistics of the DBWGIE-MR and the other algorithms on Uscd1990, Susy and Hepmass to evaluate the improvement of DBWGIE-MR algorithm on larger scale datasets, The F-statistics is shown in Fig. 9.

Fig. 9

F-statistics.

The “All” in the Fig. 9 represents the overall difference of all algorithms, it can be seen from which that the value of F-statistics has reached a huge level, it means that there is one algorithm that had a great improvement compared to others. So, we compare DBWGIE-MR with H-DBSCAN, MR-VDBSCAN and DBSCAN-MR algorithms separately to test whether this huge improvement is caused by DBWGIE-MR. The results are shown in Fig. 9, on the Susy dataset, the F-statistics between H-DBSCAN and DBWGIE-MR has already reached to 9.979E+2, between DBSCAN-MR and DBWGIE-MR, the F-statistics of which reach to 6.199E+3. The same thing happened on the Hepmass dataset, between DBWGIE-MR and other algorithms, the F-statistics has reached a huge level, which rejected the assumption H₀. This proves that DBWGIE-MR brings a distinct improvement to the clustering results compared with other algorithms on Susy and Hepmass datasets. On the Uscd1990 dataset, although the improvement of DBWGIE-MR seems not obvious, the F-statistics calculated in Table 2 is much larger than the p, which makes H₀ invalid. Therefore, the improvement of DBWGIE-MR algorithm in all aspect is significant. The experimental results also show that the improvement of DBWGIE-MR algorithm is meaningful.

4.6 Performance analysis in big data

4.6.1 Analysis of speed-up ratio

The speed-up ratio is the ratio of time consumed by the same task running in a single processor system and parallel processor system, and it is usually used as an important index to test the performance of a parallel algorithm, which is the bigger, the better. In this experiment, we randomly extracted four subsets from the Hepmass dataset, including 100000 rows (0.1M), 3000000 rows (3M), 5000000 rows (5M), and 10000000 rows (10M), respectively, and calculated the speedup-ratio of the algorithm with different nodes on these datasets to measure the computing power of the algorithm on the Hadoop parallelization framework. The experimental results of the DBWGIE-MR algorithm are shown in Fig. 10.

Fig. 10

Speed-up ratio.

As shown in Fig. 10, DBWGIE-MR has a big speed-up ratio in processing large datasets. However, when processing the small dataset, the speed-up ratio does not increase but decrease with the increase of nodes. As shown in 0.1M in Fig. 10, when there is only one node, the speed-up ratio is 1, when the number of nodes increases to 4, the speed-up ratio drops to 0.6. This is because when the size of the dataset is far less than the amount of data processed by the cluster, distributing data to different computing nodes will produce different time cost, including cluster running time, task scheduling time, node storage time, etc., which reduces the computing speed of the algorithm, so the parallel effect is low in this case. When the algorithm runs on a larger dataset, the speed-up ratio begins to increase with the increase of nodes. As 3M in Fig. 10, the speed-up ratio of the algorithm increases from 1 to 1.2 with the number of nodes increases to 2, and when the number of nodes is 4, the speed-up ratio reaches 2.4, which shows that when the amount of data is large enough, the more nodes, the higher the efficiency of the algorithm. When the size of the data reaches 5M and 10M respectively, with 2 nodes, the speed-up radio of the algorithm is 1.4 and 1.5, respectively; with 3 nodes, the speed-up radio is 2.0 and 2.4, respectively; with 4 nodes, the speed-up radio is 3.1 and 3.7, respectively. In the case of the same numbers of nodes, the larger the data, the higher the speed-up radio of the algorithm, this is because the algorithm has more advantages in parallel computing and merging local clusters when processing larger dataset, which makes the speed-up ratio increases linearly with the increase of computing nodes, and the parallel effect of the algorithm is greatly improved. It also indicates that the DBWGIE-MR algorithm is suitable for large scale datasets, and the effect of parallelization increases with the growth of computing nodes.

4.6.2 Analysis of running time

The running time is the time taken to get clustering results, which is the shorter, the better. We recorded the running time of the four algorithms on the four subsets of Hepmass to compare the speed of DBWGIE-MR with that of H-DBSCAN, DBSCAN-MR and MR-VDBSCAN algorithms, respectively. The experimental results are shown in Fig. 11.

Fig. 11

Running time.

As shown in Fig. 11, it is easy to see that the DBWGIE-MR algorithm is more efficient in parallel on large datasets, it takes less time to clustering compared with the other algorithms and the larger the dataset, the clearer is this advantage. When the number of data points is 0.1M, the running time of the four algorithms is similar, this is because when processing small dataset, distributing data to different computing nodes will produce different time cost, including cluster running time, task scheduling time, node storage time, etc., these times take up a large part of all the time to complete clustering, the parallel advantage of the DBWGIE-MR is not considered. However, when the size of the dataset reaches 3M, the running time of DBWGIE-MR algorithm is 15.3%, 20.8% and 29.4% less than that of DBSCAN-MR, MR-VDBSCAN and H-DBSCAN algorithms. When the size of the dataset reaches 5M, in Fig. 11, the running time of DBWGIE-MR algorithm is 19.1%, 25.9% and 35.2% less than that of DBSCAN-MR, MR-VDBSCAN and H-DBSCAN algorithms. Especially in 10M in the Fig. 11, the running time of DBWGIE-MR algorithm is 23.4%, 29.1% and 37.4% less than that of DBSCAN-MR, MR-VDBSCAN and H-DBSCAN algorithms. This is because, when clustering on large-scale datasets, the number of local clusters generated increases significantly in the clustering. However, H-DBSCAN algorithm takes a uniform way to divide grids, although it reduces the time of data division, but it generates a lot of boundary points, which takes more time to merge clusters later, and it does not merge local clusters in parallel. DBSCAN-MR algorithm and MR-VDBSCAN algorithm use PRBP algorithm to divide data, which produce the fewest boundary points, so that they take less time in merging clusters. Therefore, DBSCAN-MR and MR-VDBSCAN algorithms run faster than H-DBSCAN algorithm. But MR-VDBSCAN algorithm needs to calculate the density parameters of different partitions in the local clustering stage which make it is slower than DBSCAN-MR algorithm. Also, both of them do not merge local clusters in parallel, which makes them inefficient in clustering. For the DBWGIE-MR algorithm, although it takes some time to compute the weighted grid, it uses the COMCORE-MR algorithm to compute the local clustering in parallel, and the MECORE-MR algorithm to merge the local clustering in parallel, which allows it to take the least time to complete the clustering. Therefore, we can see that on large datasets, the DBWGIE-MR algorithm has the highest parallel efficiency and the shortest time to complete clustering. The experiments show that the DBWGIE-MR algorithm has a decisive advantage compared with other algorithms in running time and the improvement is visible.

5 Conclusion

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the well-known algorithms suffer from severe drawbacks when applied to large spatial databases. In this paper, we design a density-based clustering algorithm using the weighted grid and information entropy based on MapReduce, noted as DBWGIE-MR, to deal with the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density. The DBWGIE-MR algorithm consists of three parts: data partitioning, local clustering, and global merging. For each part, we propose several new strategies to improve the algorithm. For data partitioning, we propose an ADG strategy to divide the grid adaptively; For local clustering, we design a NE strategy which can strengthen the relevance between grids to improve the accuracy of clustering, design a WGIE strategy to calculate the density of the grid, and propose an COMCORE-MR algorithm to parallel compute the core clusters; For global merging, we propose an MECORE algorithm to speed up the convergence of merged local clusters, and an MECORE-MR algorithm to get the clustering algorithm results faster. In the experiment, we compared and analyzed with the other algorithms on the four real datasets, the results showed that DBWGIE-MR has significantly improved in accuracy of clustering, running speed, and stability of clustering. Besides, DBWGIE-MR was proved to have a better parallelization effect on large-scale datasets, which also showed that DBWGIE-MR is more efficient than H-DBSCAN, DBSCAN-MR and MR-VDBSCAN. Although the performance of our algorithm is better than other algorithms, this paper does not solve the problem of setting the density threshold of grid and our algorithm cannot identify clusters with different densities, which calls for an improved work in further studies.

Footnotes

Acknowledgments

This study was supported by the National Key Research and Development Program of China (2018YFC1504705) and the National Natural Science Foundation of China (41562019).

References

Bollobas

, Simon

I.S.

, Probabilistic analysis of disjoint set union algorithms, SIAM Journal on Computing 22(5) (1993), 1053–1074.

Dai

B.R.

, Lin

I.C.

, Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition, In 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) (2012), 59–66.

Aljumaily

, Laefer

D.F.

, Cuadra

, Urban Point Cloud Mining Based on Density Clustering and MapReduce, Journal of Computing in Civil Engineering 31(5) (2017).

Behrooz

, Kourosh

, A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark, Symmetry 10(8) (2018), 342.

Fang

, Qiang

, Ji

, et al., Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform, Remote Sensing 9(12) (2017).

, Xi

, Research on Clustering Algorithm and Its Parallelization Strategy, In International Conference on Computational and Information Sciences (2011), 325–328.

Ankerst

, Breunig

M.M.

, Kriegel

H.P.

, Sander

, OPTICS: Ordering Points To Identify the Clustering Structure, Proceedings of the ACM SIGMOD International Conference on Management of Data 28(2) (1999).

Chen

M.S.

, Han

J.W.

, Yu

P.S.

, Data mining: An overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering 8(6) (1996), 866–883.

Bhardwaj

, Dash

S.K.

, VDMR-DBSCAN: Varied Density MapReduce DBSCAN, International Conference on Big Data Analytics 9498 (2015), 134–150.

10.

Guha

, Rastogi

, Shim

, Cure: An efficient clustering algorithm for large databases, Information Systems 26(1) (2001), 35–58.

11.

Heidari

, Alborzi

, Radfar

, et al., Big data clustering with varied density based on MapReduce, Journal of Big Data 6(1) (2019).

12.

Liu

S.F.

, Meng

D.X.

, Wang

X.Y.

, et al., DBSCAN algorithm based on grid cell, Journal of Jilin University (Engineering and Technology Edition) 44(4) (2014), 1135–1139.

13.

Mahran

, Mahar

, Using Grid for Accelerating Density-Based Clustering, In 2008 IEEE International Conference on Computer and Information Technology (2008), 35–40.

14.

Wang

, Wang

H.J.

, Qin

X.P.

, et al., Architecting Big Data: Challenges, Studies, and Forecasts, Chinese Journal of Computers 34(10) (2011), 1741–1752.

15.

Silva

T.L.C.D.

, Neto

A.C.A.

, Magalhães

R.P.

, et al., Towards an Efficient and Distributed DBSCAN Algorithm Using MapReduce, Enterprise Information Systems 227 (2015).

16.

Wang

W.Q.

, Wang

, Singh

V.P.

, et al., Evaluation of information transfer and data transfer models of rain-gauge network design based on information entropy, Environment Research 178 (2019), 108686.

17.

X.F.

, Wang

Y.G.

, Ge

Y.N.

, et al., Research and application of DBSCAN algorithm based on Hadoop platform, Pervasive Computing and the Networked World 8351 (2013), 73–87.

18.

X.D.

, Zhu

X.Q.

, Wu

G.Q.

, Ding

, Data Mining with Big Data, IEEE Transactions on Knowledge and Data Engineering 26(1) (2013), 97–107.

19.

Wang

, Wu

, Jiang

X.H.

, et al., Incremental Parallelization of Fast Clustering Based on DBSCAN Algorithm under Large-scale Data Set, Computer Applications and Software 4 (2018), 269–275.

20.

Kim

, Shim

, Kim

M.S.

, Lee

J.S.

, DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce, Information Systems 42 (2014), 15–35.

21.

Y.W.

, Zhao

J.D.

, Wang

X.D.

, Wang

, Zhang

Y.G.

, Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop, International Journal of Distributed Sensor Networks 2015 (2015), 1–13.

22.

Zhang

Y.F.

, Chen

S.M.

, Yu

, Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce, IEEE Transactions on Knowledge and Data Engineering (ICDE) (2017), 67–68.

23.

https://archive.ics.uci.edu/ml/index.php.

DBWGIE-MR: A density-based clustering algorithm by using the weighted grid and information entropy based on MapReduce

Abstract

Keywords

1 Introduction

2 Preliminary

2.1 MapReduce

2.2 DBSCAN

2.3 Uniform data gridding

2.5 Information entropy

3 Algorithm DBWGIE-MR

3.2.1 Building weighted grid

3.3.1 Merging local clusters

3.3.2 Parallel merging local clusters

3.4 Procedures of DBWGIE-MR

3.5 Time complexity of the algorithm

4 Evaluation

4.1 Experiment settings

4.2 Data sources

Table 1 Details of the datasets Datasets Attributes Records Sizes (MB) Iris 4 150 0.86 Uscd1990 68 2458285 345 Susy 18 5000000 880.5 Hepmass 28 10500000 2478.4

4.3.1 F-measure

4.6.1 Analysis of speed-up ratio

Footnotes

Acknowledgments

References

Table 1
Details of the datasets

Datasets Attributes Records Sizes (MB)

Iris 4 150 0.86

Uscd1990 68 2458285 345

Susy 18 5000000 880.5

Hepmass 28 10500000 2478.4