A GPS location data clustering approach based on a niche genetic algorithm and hybrid K-means

Abstract

GPS location data is rapidly increasing and has become an important part of spatial information technology and its applications. In order to use K-means to discover hidden information behind GPS data, the drawbacks of K-means must be addressed, such as the difficulty of discovering the number of clusters, sensitivity to initial cluster center (seed) selection, and ease of falling into local optima. This paper presents a novel sharing-based niche genetic algorithm (NGA) with a novel initial population approach based on hybrid K-means to obtain the best chromosome which is then used to perform K-means clustering (termed NicheClust). SSE, DB-index, PBM-index, and COSEC are used as fitness functions for NGA. The experimental results demonstrate that NicheClust has high performance and efficiency for three GPS location datasets.

Keywords

GPS location data clustering niche genetic algorithm hybrid K-means cluster evaluation

1. Introduction

Spatial location data applications have transformed the public availability of services, from virtual maps to consumer GPS devices (e.g., GPS-enabled smart phones, GPS-based taxicabs, etc.). Such applications have enriched our lives through location-based services, and have also started to impact the use of geospatial reasoning in sensing and inference across space and time [46, 47]. These changes in spatial data produce a large quantity of geospatial data (or geo-location data [20]) and has been used in many fields (e.g., shared governance, navigation, smart city, site selection, rescue, and logistics) [47, 52, 53, 63]. Typically, geo-location data from GPS devices has the potential to facilitate the understanding of activity behavior, the discovery of interesting relationships, and also determine characteristics of urban areas [16, 20, 26, 32, 38, 58, 62]. Furthermore, location data mining in particular has been often used to explore a city’s population density, land use, social activity, and telecommunications in recent years [32, 34, 44, 56]. In most of the existing GPS location data mining studies that have been conducted [18, 32], such GPS location data is usually divided into a set of stops and moves [49], OD (origin-destination) [26], and MO (move object) patterns [35, 58]. Following this, clustering algorithms are used to find clusters [16, 21, 27, 37, 38] and mine POIs (points of interest) quickly [59]. For example, the work in [58] helped to reduce the cruising time and increase profitability for taxicabs using mobility patterns (MO). The work in [16, 27, 37] exhibited high performance in geospatial data clustering using clustering analysis methods. In [32], the proposed CLARANS clustering technique can identify spatial data structures and can handle point objects and polygon objects. In [57] a variance-entropy-based clustering approach was proposed using the location data of GPS-equipped taxicabs, which was developed to estimate the distribution of travel time between two landmarks (locations) in different time slots.

Data clustering analysis is an unsupervised learning task [2, 15, 54, 55], for which K-means is a commonly-used and partitioning-based clustering technique, perhaps due to its simplicity and speed. However, K-means clustering requires a user to determine cluster numbers as input to the algorithm [22, 41]. Moreover, K-means is prone to the generation of poor quality results if the initial cluster centers (seeds) are not properly chosen [10, 41], and it is difficult to attain the global optimal solution [7, 29, 41]. Therefore, in order to overcome the shortcomings of K-means clustering, NicheClust (niche genetic algorithm with hybrid K-means) is presented in this paper which combines an improved NGA (niche genetic algorithm [28, 51]) with hybrid K-means (canopy-based and K-means++) for GPS location data clustering. NicheClust includes an improved canopy (see Definitions 5–6) and K-means++ (see Definition 7) [5, 30] which are used to generate the initial population without requiring a user to input the number of clusters and randomly select initial seeds, and can be used to handle unequal gene numbers. Also, compared to GenClust [41], the advantage of NicheClust is that it is effective and fast without having the impact of many input parameters, and can maintain population diversity and avoid premature convergence through the change in the number of niches. Typically, the best chromosome with NGA optimization is used to obtain cluster numbers and initial seeds; the number of genes of the best chromosome is the number of clusters and the genes of the best chromosome are considered to be the initial seeds.

To verify the performance of NicheClust, this paper includes comparisons with GenClust [41] and GAK [23] for three GPS location datasets which consist of taxicab trajectory data for Beijing, Shanghai [61] and Nanjing, China for four fitness functions, including SSE, DBI (Davis-Bouldin Index) [13], PBM [33], and COSEC [41]. Three cluster evaluation criteria (separation (SP) [17], stability (S) [17], and silhouette coefficient (SC) [3]) are used to verify the efficiency. The test results indicate that NicheClust achieves better quality clusters than the two existing GenClust and GAK techniques.

Therefore, the main contributions of NicheClust are summarized as follows:

1.
The step-based canopy and K-means++ method is used to generate the initial population.
2.
The canopy-based niche partition method in the sharing-based NGA is proposed to maintain population diversity.
3.
The novel adaptive crossover and mutation methods are presented to avoid premature convergence.
4.
The best chromosome is used to perform GPS location data clustering.

The remainder of the paper is organized as follows: Section 2 gives the related works including location data and clustering algorithms. Sethe necessary definitions foction 3 gives r NicheClust. The NicheClust approach is presented in Section 4. Experimental tests are described in Section 5 and the results presented. Section 6 gives conclusions.
2. Related work

Nowadays, spatial location data has transformed how we access, store, visualize, and make use of geographic data (or geo-location data). Imagine our life without GPS-enabled smart phones, GPS-based cars, or emergency operations without interactive, dynamic maps [46]. A number of transformative spatial data technologies have become deeply integrated into computer science through ideas like spatial databases, spatial statistics, and spatial data mining, helping to answer many kinds of questions humans have always asked [47]. For example, as the proliferation of mobile digital information technologies continues to generate massive geospatial datasets through GPS devices (e.g., trajectory data with GPS location information) about our everyday life [8, 44, 53], these data hide many movement and behavior patterns that can be explored to use in our lives and social development.

Many researchers have a definite interest in analyzing spatio-temporal phenomena; they have considered spatial data patterns, including stops and moves [49], origin-destination (OD) [26], and move object (MO) [35], which can be used to study human activities and movements in space and time in urban areas [16, 20, 26, 32, 38, 58, 62], and to extract various location points behind the information/values (e.g., POI-based recommender applications [59]). For example, in [9] a network-based MO is proposed where the user can control the behavior of the generator by redefining the functionality of selected object classes. On the other hand, spatial data mining techniques have been used to discover interesting relationships and characteristics that may exist implicitly in datasets [32]. In order to obtain these results by considering different aspects of mobility relationships and characteristics, a great quantity of clustering algorithms (e.g., K-means and other algorithms incorporating K-means) have also been presented to mine location datasets and are often required to generate results suitable for users. For example, in [32], the proposed CLARANS clustering technique can identify spatial data structures and can handle point objects and polygon objects. In [57] a variance-entropy-based clustering approach was proposed that was used to handle the location data of GPS-equipped taxicabs, in order to estimate the travel time distribution between two landmarks (locations) in different time slots. However, none of these clustering algorithms are suitable for all types of data, clusters, and applications, nor are all algorithms appropriate for all problems of clusters [2].

In fact, clustering is an important data analysis approach and data mining technique, which is deemed one of the most difficult and challenging problems in unsupervised machine learning [55]. There are many clustering algorithms, which have been roughly categorized into six classes [18]: partition-based (e.g., K-means and variants), density-based (e.g., DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS (Ordering Points to Identify the Clustering Structure)), model-based (e.g., Gaussian model and regression model), hierarchical-based (e.g., LINK, CURE, Chameleon), graph-based (e.g., spectral clustering), and grid-based (e.g., flexible grid-clustering [4], and grid-growing clustering [61]).

Of these, K-means is a commonly-used technique due to its simplicity and effectiveness. However, K-means requires a user to provide the number of clusters $k$ initially [22, 41], and it may converge to a partition that is significantly inferior compared to the global optimum [45]. Based on this user-defined value of the number of clusters, K-means randomly selects $k$ initial seeds (cluster centers) from a dataset [2]. Fortunately, several techniques have been proposed for finding higher quality initial seeds than those obtained through random selection [39, 40]. For example, the work in [42] employed a $k d$ -tree to choose $K$ seeds for K-means through the use of density information of datasets. CFSFDP [6] presented a fast density clustering method based on the K-means algorithm which finds clusters with arbitrary shapes and does not require the number of clusters as input. In [36], a parallel OPTICS approach was proposed to find the initial seeds. Meanwhile, several GAs (genetic algorithms) with K-means have been developed and used to obtain a global optimal solution, with better initial seeds and number of clusters [3, 11, 25]. For instance, in [31] a model-based method called VGA-SVM (fuzzy clustering algorithm-support vector machine) with GA-based clustering was proposed which is used to handle the insufficiency (e.g., global optimization problem) of unsupervised clustering. The AGCUK algorithm was presented in [25] which is a GA-based clustering algorithm that uses DBI to automatically identify the number of clusters. NGKA [48] combines a GA to prevent premature convergence during clustering with SSE (sum of squared error). Typically, GenClust [41] can automatically identify the appropriate number of clusters and identify the appropriate seeds through a density-based initial population.

3. Preliminaries

Definitions are given in this section which are used to describe the NicheClust algorithm. First, assuming a linear interpolation [50] between sample locations (consisting of longitudes and latitudes), we present Definition 1 of the location data in terms of [16, 27, 38]. Second, the basic clustering of location datasets is defined in Definition 2. This is used in the description of the location dataset’s clustering method. A distance measure between location data points is calculated by the Euclidean distance and the similarity between chromosomes is also calculated by the cosine similarity measure, and are given in Definitions 3 and 4. These concepts are used in the description. Then, the definitions of the improved canopy and niche operation are presented in Definitions 5–8.

Definition 1.(GPS location data) An arbitrarily given location data subset $\ {CR}_{1}=\left\{G_{1},G_{2},\dots,G_{i}\right\}\$ consisting of $\bm{i}$ points with positions (longitude, latitude) where each location data point has a set of attributes $\ G_{1}=(\textit{log}_{1},\textit{lat}_{1})$ , ${CR}_{1}=\{\left(\textit{log}_{1},\textit{lat}_{1}\right),\left(\textit{log}_{% 2},\textit{lat}_{2}\right),\dots,(\textit{log}_{i},\textit{lat}_{i})\}$ , can be also known as a trajectory [16]. If there are $\bm{j}$ location data subsets, then $CR=({CR}_{1},{CR}_{2},\dots,{CR}_{j})$ , and location datasets $\bm{R}^{N}$ are defined by $\left({CR}_{1},{CR}_{2},\dots,{CR}_{N}\right)$ . For $(\textit{log},\textit{lat})$ , log denotes longitudes and lat denotes latitudes of the attribute.

An arbitrarily given ${CR}_{1}$ is also defined as a chromosome in this paper, and a given location data point is defined as a gene.

Definition 2.(GPS location data clustering) Let the $K$ clusters be represented by $C=\left\{C_{1},\dots,C_{K}\right\}$ in $\bm{R}^{N}$ , and seeds of a given cluster $C_{1}$ are represented by $\textit{Seed}_{1}=\left\{\textit{seed}_{1},\textit{seed}_{2},\dots\right\}$ , where $n$ stands for the number of GPS location points in $CR_{1}$ , and $K_{\text{max}}=\sqrt{n}$ [24, 25].

A clustering matrix $U$ , of size $N\times K$ , of a given $C R$ may be represented by $U=\left[u_{kl}\right](l=1\dots n)$ , where $u_{kl}$ is the membership pattern $x_{i}$ to a cluster $C$ . When ${u}_{kj}=$ 1, the location data point is assigned to a cluster $C$ , the l-th pattern belongs to the k-th cluster; otherwise ${u}_{kj}=$ 0.

Definition 3.(Distance between location data points) Let ${CR}_{i}$ be a chromosomes in $\bm{R}^{N}$ , and $G_{i}$ be made of a number of genes ${CR}_{i}=(G_{i1},G_{i2},\cdots,G_{ik})$ put together in a sequence, where any one $G_{i1}$ is the same as a seed $\textit{seed}_{1}$ , which is the center of a cluster $C_{1}$ . A cluster $C_{1}$ is the set of records that have the smallest distance to the gene/seed of the cluster, where the genes $G_{ik}$ have the same attributes $G_{i}$ as any $CR_{i}$ . The distances of genes/seeds between ${CR}_{i}$ and $G_{i1}$ are calculated by employing the Euclidean distance:

$\displaystyle\textit{dist }C_{1}=\left\{{CR}_{i}:\textit{dist}\left({CR}_{i},G% _{i1}\right)=\sqrt{\sum^{K}_{i=1}{{({CR}_{i}-G_{i1})}^{2}}}\leqslant\sqrt{\sum% ^{K}_{i=1}{{\left({CR}_{i}-G_{ip}\right)}^{2}}};(\forall p\neq 1)\right\}$

Definition 4.(Similarity between chromosomes) Let ${CR}_{i},{CR}_{j}$ be two chromosomes in $\bm{R}^{N}$ , ${CR}_{b}$ is the best chromosome with the best fitness (see Section 4), then the similarity of the chromosomes ${CR}_{i},{CR}_{j}$ is calculated with the cosine similarity as follows:

$\displaystyle\textit{Sim}\left({CR}_{i},{CR}_{j}\right)=\frac{\sum^{K}_{i,j=1}% {\left({CR}_{i}-{CR}_{b}\right)\left({CR}_{j}-{CR}_{b}\right)}}{\sqrt{\sum^{K}% _{i=1}{{\left({CR}_{i}-{CR}_{b}\right)}^{2}}}\sqrt{\sum^{K}_{j=1}{{\left({CR}_% {j}-{CR}_{b}\right)}^{2}}}}$

where ${CR}_{ik},{CR}_{jk}$ consist of gene numbers, and shorter distances between GPS location data points indicates higher similarity.

Definition 5.(Improved canopy threshold) Given the datasets $C R$ in $\bm{R}^{N}$ , for each location point in the dataset, it is assigned appropriately according to its positions, and the row $\left(ro\in 1,2,\cdots,n\right)$ and column $\left(\textit{colu}\in 1,2,\cdots,n\right)$ of the location data points are calculated as follows:

$\displaystyle ro=\left\lceil\frac{\textit{log}_{\max}-\textit{log}}{\textit{% log}_{\max}-\textit{log}_{\min}}\cdot{ts}_{x}\right\rceil,\textit{colu}=\left% \lceil\frac{\textit{lat}_{\max}-\textit{lat}}{\textit{lat}_{\max}-\textit{lat}% _{\min}}\cdot{ts}_{y}\right\rceil,\dots$

where $\textit{log}_{\max}$ and $\textit{lat}_{\max}$ are the maximum values among log and lat coordinates, whereas $\textit{log}_{\min}$ and $\textit{lat}_{\min}$ are the minimum values, and ${ts}_{x}$ and ${ts}_{y}$ are defined as the thresholds among log and lat coordinates by the user.

If the location data includes $\bm{n}$ attributes, then the thresholds ( $\textit{tsd}_{n}$ ) must be calculated in terms of the number of attributes, defined as follows:

$\displaystyle\textit{tsd}_{n}=\left\lceil\frac{x_{\max}-y}{x_{\max}-y_{\min}}% \cdot{ts}_{n}\right\rceil$

Definition 6.(Improved canopy) Given the location dataset $X$ , $\forall G_{i}\in CR$ , if $\exists C_{m}\left(0<m\leqslant K\right)$ that meets set $\{c_{m}|(\exists||G_{ix}-c_{m}||\leqslant ro)\bigwedge(\exists||G_{iy}-c_{m}||% \leqslant\textit{colu})\neq\emptyset,t_{1}>t_{2}=\sqrt{({ro}^{2}+\textit{colu}% ^{2})},c_{m}\subseteq CR,i\neq m\}$ then $\left(G_{ix},G_{iy}\right)$ are defined for the canopy set, and $c_{m}$ is the center of the canopy.

If the given location data includes $\bm{n}$ attributes, then, according to Definition 5, $\{c_{m}|(\exists||G_{i1}-c_{m}||\leqslant\linebreak\textit{tsd}_{1})\bigwedge(% \exists||G_{i2}-c_{m}||\leqslant\textit{tsd}_{2})\bigwedge\dots\bigwedge(% \exists||G_{in}-c_{m}||\leqslant\textit{tsd}_{n})\neq\emptyset\}$ :

$\displaystyle{t}_{1}>t_{2}=\sqrt{({\textit{tsd}_{1}}^{2}+{\textit{tsd}_{2}}^{2% }+\dots+{\textit{tsd}_{n}}^{2})},{c}_{m}\subseteq CR,i\neq m.$

Definition 7.(K-means++) [5] Given the location dataset $D$ , $\forall x_{i}\in X_{n}$ , $x_{i}$ is defined for a seed $\textit{seed}_{i}$ of cluster ${C}_{i}$ , and the $\textit{sum}\left(\textit{dist }C_{i}\right)$ is the calculated distance around ${x}_{i}$ to employ $\textit{dist }C_{i}$ . To take a new seed $\textit{seed}_{j}$ of cluster $C_{j}$ , let the distance between $\textit{seed}_{i}$ and $\textit{seed}_{j}$ be the farthest. In particular, let $\textit{dist }C_{1}$ denote the shortest distance from a location data point to the closest center that has already been chosen, then randomly choosing $\forall x_{j}\in X_{n}$ with probability:

$\displaystyle P_{\textit{seed}}=\frac{{\textit{dist }C_{j}}^{2}}{\sum_{j\in X_% {n}}{{\textit{dist }C_{j}}^{2}}}$

within a range value $\textit{sum}\left(\textit{dist }C_{i}\right)$ , and run $\textit{random}=-\textit{sum}\left(\textit{dist }C_{i}\right)$ . We repeat until $\textit{random}<0$ , and we take $\{\textit{seed}_{1},\textit{seed}_{2},\dots,\textit{seed}_{K}\}$ altogether as the initial seeds set for the initial population.

Definition 8.(Niche) [19, 43] Sharing-based niche parameters are similar between chromosomes $\textit{simi}\left(\textit{POP}_{i},\textit{POP}_{j}\right)$ , sharing radius, population size, the number of clusters, sharing degree function $\textit{Share}\left(\textit{simi}\right)$ , and the sum of sharing degree functions. In this paper, $\textit{simi}\left(\textit{POP}_{i},\textit{POP}_{j}\right)$ is also defined by the cosine similarity measure as follows:

$\displaystyle\textit{simi}\left(\textit{POP}_{i},\textit{POP}_{j}\right)=\frac% {\sum^{M+N}_{i,j=1}{\textit{POP}_{im}\cdot\textit{POP}_{jm}}}{\sqrt{\sum^{M+N}% _{i=1}{{\textit{POP}_{im}}^{2}}}\sqrt{\sum^{M+N}_{j=1}{{\textit{POP}_{jm}}^{2}% }}}\left(\begin{array}[]{c}i=1,2,\dots,M+N-1\\ j=i+1,\dots,M+N\end{array}\right)$

where POP denotes chromosomes, $M$ denotes the population size, $N$ denotes the niche population size, and $M$ is equal to N/2 in this paper.

4. The NicheClust clustering algorithm

The basic concepts of NicheClust are presented in the following sub-section and then a formal presentation of the GPS location data clustering technique is given, as shown in Fig. 1.

4.1 The main contributions of NicheClust

The key idea of NicheClust is the combination of NGA with hybrid K-means.

4.1.1 NGA chromosome representation

The chromosome of a GA is made up of a sequence of genes that encode, for example, binary digits, integers, symbols and floating-numbers [11]. Here, the genes describe GPS location data points and each chromosome in the population is a potential solution that is used to perform K-means clustering. The chromosomes are acquired via the improved canopy and K-means++. The number of genes of a chromosome is set to be in the range $\left[2,\sqrt{n}\,\right]$ [39], where $n$ is the number of location data points and each gene in a chromosome is a record ${X}_{i}$ or $X_{j}$ which is chosen from the GPS location dataset.

Figure 1.

The NicheClust processing diagram.

4.1.2 An initial population and the number of clusters technique using an improved canopy

Canopies are usually used to generate the number of clusters, but this can only be a rough estimate [30]. If the density of GPS data points of canopies is high by the given thresholds (with two distance thresholds $T_{1},T_{2}$ , where ${T}_{1}>T_{{2}}$ ), then canopies are usually trapped in a local optimum. Meanwhile, when there is a large volume of GPS location data points, the computational cost of $T_{{2}}$ is also very high. The improved canopy approach presented here captures cluster numbers with a given step value and range in terms of a user’s needs, which can be used to generate the initial population. According to the attributes of the GPS location data, $T_{{2}}$ can be determined (see Definitions 5 and 6), and the distances between GPS location points are calculated by the Euclidean distance measure (see Definition 3).

4.1.3 Initial cluster centers (Seeds) of NicheClust with K-means++

K-means++ is a simple and fast initial seed algorithm, which can be used to improve the quality of K-means clustering [2, 41, 54, 60]. Therefore, the improved canopy method with distance is first used to randomly choose cluster numbers (see Section 4.1.2) and then used as the initial chromosome’s representation. For example, 30 chromosomes are chosen which consist of the different gene numbers of each chromosome with the canopies set (see Section 4.1.2 and Definition 7). Then K-means++ is employed to capture the initial seeds from the GPS location dataset for each chromosome. Finally, the best chromosome will be selected and used for the operation of NGA with the Euclidean distance measure.

4.1.4 Cosine-based similarity for an improved gene rearrangement technique

Some existing gene rearrangement methods [11, 48] have been used to handle equal length chromosomes in previous years. The gene rearrangement technique in GenClust [41] can be used to handle unequal lengths of chromosomes, but this only considers gene rearrangement with distance comparison without explaining the structure between reference chromosomes and the target chromosome. Therefore, in this paper, the best chromosome is first selected in terms of its fitness value, and a pair of chromosomes are chosen with an existing roulette wheel method [25, 31] from the initial population where the chromosomes ${CR}_{i},{CR}_{j}$ are picked with a probability $p\left({CR}_{i},{CR}_{j}\right)=f\left({CR}_{i},{CR}_{j}\right)/\sum^{\textit{% NIND}}_{i=1,j=i+1}{f\left({CR}_{i},{CR}_{j}\right)}$ . Here, $f\left({CR}_{i},{CR}_{j}\right)$ is the fitness value of the pair of chromosomes ${CR}_{i},{CR}_{j}$ and NIND is the size of the population. Secondly, the cosine measure (see Definition 4) is employed to calculate the similarity between chromosomes and then the maximum values of the chromosome is used as the reference chromosome for gene rearrangement. Finally, the gene rearrangement technique in GenClust [41] is used to implement the gene rearrangement operations and then to obtain equal length chromosomes.

4.1.5 Niche partitioning based on the improved canopy method

In this paper, a sharing-based niche technique is employed to prevent premature convergence and maintain population diversity. Therefore, the improved canopy method (see Section 4.1.2 and Definitions 5, 6, and 8) is designed to divide the population into a number of niches, and then a sharing-based niche function is presented to readjust the sharing values.

4.2 Fitness functions of the NGA

The fitness function is used to control candidate solutions and verify objective measures for NGA [3]. Therefore, some evaluation criteria, including Dunn‘s index (DI), PBM (PBM-index), SSE, Xie-Beni index (XBI), DB-index and COSEC [41] can typically be used as fitness functions [33]. In this paper, SSE, DB-index, PBM, and COSEC are used as fitness functions for NGA [60], and are given in Table 1.

Table 1
The fitness functions used in NGA

Fitness function	Objective function	Optimization result
SSE	$f_{\textit{SSE}}={{\text{max}}_{K}\frac{1}{\textit{SSE}}}$ where $\textit{SSE}=\sum^{K}_{i=1}{\sum_{x\in C_{i}}{d^{2}\left(X,\textit{seed}_{i}% \right)}}$	Maximum value
DBI	$f_{\textit{DBI}}=\frac{1}{K}\sum^{K}_{i=1}{{{\text{max}}_{i\neq j}\left(\frac{% \sum_{x\in C_{i}}{d^{2}\left(X,\textit{seed}_{i}\right)+\sum_{x\in C_{j}}{d^{2% }\left(X,\textit{seed}_{j}\right)}}}{d^{2}\left(\textit{seed}_{i},\textit{seed% }_{j}\right)}\right)\ }}$	Maximum value
PBM	$f_{\textit{PBM}}={\left(\frac{1}{K}\times\frac{d^{2}\left(x_{1},\textit{seed}_% {1}\right)}{\sum^{K}_{k=1}{\sum^{n}_{i=1}{d^{2}\left(x_{i},\textit{seed}_{k}% \right)}}}\times{\text{max}}^{K}_{i,j=1}\{d^{2}\left(\textit{seed}_{i},\textit% {seed}_{j}\right)\}\right)}^{2}$	Maximum value
COSEC	$f_{\textit{COSEC}}=\sum_{\forall j}{\left\{{{\text{min}}_{\forall k\neq j}d^{2% }\left(\textit{seed}_{i},\textit{seed}_{j}\right)-{\text{Comp}}_{j}}\right\}}$ where	Minimum value
	${\text{Comp}}_{j}=\frac{\sum_{x\in C_{j}}{\left(\sum^{K}_{j=1}{{\left(\frac{d^% {2}\left(X,\textit{seed}_{i}\right)}{d^{2}\left(X,\textit{seed}_{j}\right)}% \right)}^{-2/\left(m-1\right)}}\cdot d^{2}\left(X,\textit{seed}_{j}\right)% \right)}}{\sum^{K}_{j=1}{{\left(\frac{d^{2}\left(X,\textit{seed}_{i}\right)}{d% ^{2}\left(X,\textit{seed}_{j}\right)}\right)}^{-2/\left(m-1\right)}}}$ ,
	m is the fuzzy exponent according to [31]

In Table 1, $d^{2}\left(X,\textit{seed}_{i}\right)$ denotes the Euclidean distance from the gene values of each chromosome (observed vector $X$ ), and the seeds of each cluster $i$ (each chromosome) is denoted $\textit{seed}_{i}$ ; $K$ is used to denote the number of clusters; $C_{i},C_{j}$ denotes different clusters.

4.3 The key steps of NicheClust

The key steps of the NicheClust algorithm are introduced according to Section 4.1, and are shown in Algorithm 1.

Step 1: Initial population (Procedure 2). The improved canopy method and K-means++ algorithm are used to produce an initial population including NIND chromosomes from a given GPS location dataset. Therefore, the user-defined canopy thresholds and step and range values of the improved canopy are used to generate unequal length chromosomes. The genes of each chromosome are constrained to number between 2 and $\sqrt{n}$ , with the aim of determining the number of clusters. Then, K-means++ is employed to obtain the initial cluster centers (seeds) for the GPS location dataset.

Step 2: Selection operation (Procedure 3). The fitness values of the NIND chromosomes (e.g. $\textit{NIND}=$ 60) are calculated in descending order according to Table 1, and then the best $\frac{\textit{NIND}}{2}$ chromosomes are chosen and used to produce a new population ${\textit{POP}}_{s}$ . The ${CR}_{\text{bestselect}}$ (maximum fitness value) is chosen from ${\textit{POP}}_{s}$ in descending order of their fitness values, and is stored in memory. In particular, the descending order method is used to choose the best $\frac{\textit{NIND}}{2}$ chromosomes and handle NGA operations, with its aim to improve the efficiency of NGA.

Step 3: Crossover operation (Procedure 4). The crossover operator is used to incorporate diversity via gene rearrangement and adaptive crossover probability, in this paper. All chromosomes in ${\textit{POP}}_{s}$ will participate in the crossover operation as follows. First, each chromosome in ${\textit{POP}}_{s}$ needs to be separated into two parts as parent chromosomes at a random GPS location data point. Second, a single point crossover method $\textit{SPC}\left({CR}_{i},{CR}_{j},\alpha\right)$ is used to generate two offspring chromosomes, and the parent chromosomes are removed from POP of the current iteration.

A single point crossover model is presented as:

$\displaystyle\textit{SPC}\left({CR}_{i},{CR}_{j},P_{c}\right)=\left\{\begin{% array}[]{l}{CR}^{\prime}_{i}=P_{c}\cdot{CR}_{j}+(1-P_{c})\cdot{CR}_{i}\\ {CR}^{\prime}_{j}=P_{c}\cdot CR_{i}+(1-P_{c})\cdot{CR}_{j}\end{array}\right.$

where $P_{c}$ is an adaptive crossover probability between $\left(0,{1}\right)$ , which is defined as:

$\displaystyle P_{c}=\left\{\begin{array}[]{ll}\frac{f_{\text{max}}-f^{\prime}}% {f_{\text{max}}-f_{\text{avg}}}&\text{if }f^{\prime}>f_{\text{avg}}\\ 1&\text{if }f^{\prime}\leqslant f_{\text{avg}}\end{array}\right.$

where $f_{\text{max}}$ , $f_{\text{avg}}$ , and $f^{\prime}$ denotes the maximum fitness value, the average fitness value of the current population, and the larger of the finesses of the chromosomes to be crossed, respectively. If the crossover operations are performed, then the crossover population ${\textit{POP}}_{c}$ is produced and is used for the mutation operation.

Step 4: Mutation operation (Procedure 5). The mutation operator is usually used to explore different solution spaces. Adaptive mutation is used here, with an adaptive mutation probability formula $P_{m}$ [3] for chromosome $i$ as follows in ${\textit{POP}}_{c}$ :

$\displaystyle P_{m}=\left\{\begin{array}[]{ll}{\xi}_{1}\times\frac{f_{\text{% max}}-f_{i}}{f_{\text{max}}-f_{\text{avg}}}&\text{if }f>f_{\text{avg}}\\ {\xi}_{2}&\text{if }f\leqslant f_{\text{avg}}\end{array}\right.$

where ${\xi}_{1},{\xi}_{2}=$ 0.5 are mutation control coefficients; $f_{\text{max}},f_{\text{avg}}$ is the maximum and average fitness values of the chromosomes in POP, respectively; and $f_{i}$ is the fitness value of the $i$ -th chromosome.

If the maximum and minimum fitness values of chromosomes in ${\textit{POP}}_{c}$ along the $i$ dimension are $f^{i}_{\text{max}}$ and ${f}^{i}_{\text{min}}$ , respectively, then the adjusted fitness value $f^{i^{\prime}}$ of the $i$ -th chromosome is calculated as follows:

$\displaystyle f^{i^{\prime}}=\left\{\begin{array}[]{l}f^{i}+P_{m}\times\left(f% ^{i}_{\text{max}}-f^{i}\right)\\ f^{i}+P_{m}\times\left(f^{i}-f^{i}_{\text{min}}\right)\end{array}\right.$

If the mutation operations have been performed, then the mutated population ${\textit{POP}}_{m}$ will be produced, and the best fitness ${CR}_{\text{bestselect}}$ is again updated at this stage.

Step 5: Niching operation (Procedure 6). To maintain population stability and diversity of the GA [1, 12], a sharing-based niche method is introduced to search for a global optimum solution in ${\textit{POP}}_{m}$ . Therefore, an initial niche population ${\textit{POP}}_{\text{new}}$ ( ${\textit{POP}}_{\text{new}}=\textit{POP}_{m}+\textit{POP}_{N\text{max}}$ ) is produced and used for this purpose. In particular, the sharing-based fitness $f_{\text{share}}\left(i\right)$ is used to generate niche numbers with the canopy-based partitioning method for generation $g$ , where $i$ denotes chromosome $i$ to be assigned, corresponding to niches. The chromosome $\textit{POP}_{\textit{best},g}$ is again obtained to retain the best fitness values each generation, and nich can be calculated as follows:

$\displaystyle f_{\text{share},\textit{nich}}\left(i\right)=\frac{f_{g}(i)}{s_{% g,\textit{nich}}(i)}$

where $f_{g}(i)$ denotes the fitness value of chromosome $i$ in generation $g$ , and $s_{g,\textit{nich}}(i)$ is the sum of the sharing-based niche fitness values which relies on the canopy-based partitioning niche numbers within $\textit{POP}_{\text{new}}$ . The sum of the sharing-based niche degrees is calculated for each generation $g$ :

$\displaystyle s_{g,\textit{nich}}\left(i\right)=\sum^{\textit{nich}}_{i=1}{% \textit{share}\left(\textit{simi}\left(\textit{POP}_{\textit{best},g},{\textit% {POP}}_{i}\right)\right)}$

where $\textit{simi}\left({\textit{POP}}_{\textit{best},g},\textit{POP}_{i}\right)$ is used to calculate the cosine-based similarity (see Definition 8) between the best chromosome in $\textit{POP}_{\text{new}}$ and chromosome $i$ , and simi stands for the sharing function which is used to measure the sharing degree between the two chromosomes. This is defined as follows according to the work in [14]:

$\displaystyle\textit{Share}\left(\textit{simi}\left(\textit{POP}_{\textit{best% },g},\textit{POP}_{i}\right)\right)$ $\displaystyle\quad=\left\{\begin{array}[]{ll}1-{\left(\frac{\textit{simi}\left% ({\textit{POP}}_{\textit{best},g},{\textit{POP}}_{i}\right)}{\frac{\sqrt{\left% |{\textit{POP}}_{\text{new}}\right|}}{2*\sqrt[\left|\textit{POP}_{\text{new}}% \right|]{\left|{CR}_{\text{best}}\right|}}}\right)}^{2}&\text{if }\textit{simi% }\left(\textit{POP}_{\textit{best},g},\textit{POP}_{i}\right)\leqslant\frac{% \sqrt{\left|\textit{POP}_{\text{new}}\right|}}{2*\sqrt[\left|\textit{POP}_{% \text{new}}\right|]{\left|{CR}_{\text{best}}\right|}}\\ 0&\text{otherwise}\\ \end{array}\right.$

When niche operations have been executed, a niche population ${\textit{POP}}_{\text{niche}}$ can be produced and used to perform elitism operations in Step 6.

Step 6: Elitism operation (Procedure 7). To improve the quality of the chromosomes in each generation [41], an elitist population $\textit{POP}_{\text{elitism}}$ is also produced according to $\textit{POP}_{\text{niche}}$ , and the ${CR}_{\text{bestelitism}}$ in $\textit{POP}_{\text{elitism}}$ is again captured in descending order of their fitness values.

Step 7: K-means clustering operation. When NGA is executed, it returns the best chromosome, ${CR}_{\text{bestelitism}}$ . Therefore, the genes and number of genes in ${CR}_{\text{bestelitism}}$ are used as the initial cluster centers (seeds) and the number of clusters for K-means clustering, respectively. In addition, a termination condition of the K-means clustering must be defined which is defined by the following formula:

$\displaystyle\textit{SSE}_{j}=\frac{f\left(j\right)-f(j+1)}{f(j)}<\varepsilon$

where $\textit{SSE}_{j}$ is the sum of squared error between GPS data points in iteration $j$ , $\varepsilon$ is a given value for terminating K-means clustering. When K-means clustering meets a given termination condition, the clustering results of NicheClust are displayed on the Auto Navi Map (Amap).

4.4 Complexity of NicheClust

According to Section 3 and Algorithm 1, the number of data points/records in a given GPS location dataset is $n$ , the population size of NGA is NIND ( $N$ is only used to denote the population size in this section), the niche population size is $\frac{3N}{2}$ , the number of iterations is MAXGEN ( $G$ ), the number of iterations for K-means clustering is $v$ . The complexity of NicheClust consists of the initial population operation, selection operation, crossover operation, mutation operation, niche operation, elitism operation and K-means clustering operation. Therefore the complexity of NicheClust is approximately equal to $O(n)$ .

Figure 2.

Distribution of GPS location data points: (i) Beijing, China; (ii) Shanghai, China; (iii) Nanjing, China.

5. Experimental results and discussion

To test the effectiveness of NicheClust, three GPS location datasets (Beijing taxicabs, Shanghai taxicabs [61], and Nanjing taxicabs) are utilized for the experimentation, as shown in Fig. 1 and Table 1. The two columns in Table 1 show the number of GPS location data points.

Table 2
The experimental GPS location datasets

GPS location	The number of GPS location	GPS location data
dataset	data points (longitude, latitude)	collection time
Beijing	17,387	2016.03
Shanghai	11,364	[61]
Nanjing	6978	2012.09

Figure 3.

a. Three GA-based clustering results using SSE (i, ii, iii) for three GPS location datasets over 20 independent trials. b. Three GA-based clustering results using DBI (i, ii, iii) for three GPS location datasets over 20 independent trials.

Figure 3.

Continued. c. Three GA-based clustering results using PBM (i, ii, iii) for three GPS location datasets over 20 independent trials. d. Three GA-based clustering results using COSEC (i, ii, iii) for three GPS location datasets over 20 independent trials.

The outlier data points of the three GPS location datasets have been cleaned in Table 1. The clustering results based on typical evolutionary criteria are shown in Section 4.3, which includes separation (SP), silhouette coefficient (SC), stability (S), and running times (RT).

5.1 The parameters used in the experiments

All the algorithms (NicheClust, GenClust, and GAK) have the same initialization parameters. The population size of GAK, GenClust, and NicheClust for the GPS location datasets (Beijing, Shanghai, and Nanjing, China) are equal to 60, 40, and 32, respectively. The number of iterations used for NGA is 120. The termination condition for K-means clustering is set to be 10 ${}^{-6}$ (i.e., the distance values between GPS location data points is less than 10 ${}^{-6}$ ). The user-defined canopy thresholds are tslog $=$ 0.0925, tslat $=$ 0.0925, and the range of step values for the improved canopy is set to be [0, 0.3], with sv $=$ 0.01. In addition, the parameter values of GenClust and GAK are set according to [41, 7], respectively. In NicheClust, the parameter values will be determined automatically. All fitness values $f$ in the compared algorithms need be normalized with the following formula in order to have the same comparison standard:

$\displaystyle f_{n\text{orm}}=\frac{f-f_{\text{min}}}{f_{\text{max}}-f_{\text{% min}}}$

where $f_{n\text{orm}}$ denotes the normalized values of all fitness function values, $f$ denotes the current fitness values, and $f_{\text{max}}$ and $f_{\text{min}}$ denotes the maximum and minimum fitness values.

5.2 Experimental results

This section compares the performance of NicheClust with the other algorithms. We firstly compare the speed and smoothness of the GA-based clustering algorithms (NicheClust, GenClust, and GAK) for four fitness functions (see Section 3.2) for the three GPS location datasets. For each dataset, 20 independent trials are performed and then their mean values are recorded. The average fitness values of SSE, DBI, PBM, and COSEC obtained by the three algorithms are shown in Fig. 3a–d.

In Fig. 3a–d, the actual average values of the fitness functions (for 20 independent experiments) are presented for the three GPS location datasets. We also compare the clustering convergence speed and smoothness of the three algorithms in order to assess the effectiveness of NicheClust. Figure 3a–d show the average scores, and it can be seen that NicheClust obtains higher scores than GenClust and GAK for SSE, DBI, PBM, and COSEC, and the convergence speed and smoothness of NicheClust is also better than GenClust and GAK. Also, NicheClust requires fewer iterations to reach convergence. In particular, the convergence curves of NicheClust are good and do not fluctuate. The results indicate that NicheClust obtains better clustering than GenClust and GAK for the four fitness functions without being trapped in local optima.

This indicates that NicheClust has better diversity than GenClust in every iteration. The clustering results indicate that the overall efficiency and performance of NicheClust are better than GenClust [41] and GAK [7].

To validate the availability of NicheClust, the best clustering results of NicheClust for the three GPS location datasets are displayed in Amap according to [63]. NicheClust uses SSE, DBI, PBM, and COSEC to generate clustering results and shown in Fig. 4a–d.

Figure 4.

a. Clustering results shown on Auto Navi Maps (Amap) for three GPS location datasets for the taxicabs (a, b, c). The figures plot the SSE obtained by NicheClust. b. Clustering results shown on Auto Navi Maps (Amap) for three GPS location datasets for the taxicabs (a, b, c). The figures plot the DBI obtained by NicheClust. c. Clustering results shown on Auto Navi Maps (Amap) for three location datasets for the taxicabs (a, b, c). The figures plot the PBM obtained by NicheClust. d. Clustering results shown on Auto Navi Maps (Amap) for three GPS location datasets for the taxicabs (a, b, c). The figures plot the COSEC obtained by NicheClust.

Figure 4a shows that 53, 101, and 119 clusters were obtained for the Nanjing, Shanghai, and Beijing taxicab data, respectively, by NicheClust using SSE. Each cluster center can be used to reflect places with traffic flow information and high-density populations. From Fig. 4b, we obtain 54, 86, and 106 clusters for the Nanjing, Shanghai, and Beijing taxicab data, respectively, by the NicheClust algorithm using DBI. The resulting clusters are distributed in a similar manner to SSE, but the positions of the cluster centers are different. From Fig. 4c, we obtain 15, 38, and 25 clusters for the Nanjing, Shanghai, and Beijing taxicab data, respectively, by NicheClust using PBM. When PBM is used as the fitness function, it reduces the number of clusters, but still reflects the city’s population migration distribution. From Fig. 4d, we obtain 57, 90, and 92 clusters for the Nanjing, Shanghai, and Beijing taxicab data, respectively, by the NicheClust algorithm using COSEC. This clustering result also reflects traffic and population flow information.

5.3 Evaluation of clustering techniques

To evaluate and validate the performance of the GA-based clustering results (for NicheClust, GenClust, and GAK), several index models are used as evaluation criteria: separation (SP) [17], silhouette coefficient (SC) [3], stability (S) [17], and the run time (RT) in Table 3; the experimental results are given in Table 4.

Table 3
The evaluation criteria

Evaluation	Formula	Characteristics
SP [17]	$SP=\frac{2}{K(K-1)}\sum^{K}_{i=1}{\sum^{K}_{j=i+1}{\left.C_{i}-C_{j}\right.}}$	If $SP\to 0$ , then this indicates that the distances of the cluster centers between $C_{i}$ and $C_{j}$ are closer.
SC [3, 31]	$SC=\frac{1}{K}\sum^{k}_{j=1}{\sum_{x_{j}}{\frac{a_{j}-b_{j}}{\text{max}\left(a% _{j},b_{j}\right)}}}$	This denotes the clustering assignment for an observation vector $x_{i}$ and $a_{j},b_{j}\in x_{i}$ . The evaluation values of the SC-index lie between $-$ 1 and 1, and a higher SC-index value indicates a better clustering result.
Stability [17]	$S_{K}=\frac{2}{n(n-1)}\sum^{n-1}_{i=1}{\sum^{n}_{j=i+1}{S_{r}(R_{i},R_{j})}}$ where $S_{r}\left(R_{i},R_{j}\right)=\left\{\begin{array}[]{ll}1&\text{if }R_{i}(x_{i% })\neq R_{j}(x_{j})\\ 0&\text{otherwise}\\ \end{array}\right.$	This can be used to explain the stability of seeds (cluster centers) $R_{i},R_{j}$ . It lies between 0 and 1, with 0 indicating that the clustering results between all pairs of $R_{i},R_{j}$ are completely different, and 1 indicates that they are identical.

The SP, S, SC, and the run time (RT) are shown in Table 4. It is evident from this table that these index values and times over 20 runs obtained by the four fitness functions (SSE, DBI, PBM, and COSEC) of the different algorithms (NicheClust, GenClust, and GAK) are better across the three GPS location datasets in Table 2.

Table 4 shows that the results of the three GA-based clustering algorithms are acceptable with regard to the measures of SP and SC. It can also be seen that NicheClust provides the best clustering results based on SP and SC in comparison to GenClust and GAK. Firstly, it is seen that NicheClust often generates compact and dispersed clusters based on SP in comparison to the other algorithms. Secondly, Table 4 provides the average SC index scores, which indicate that NicheClust also has better clustering solutions than GenClust and GAK. NicheClust not only generates compact clusters, but also produces well-separated clusters. Thirdly, in order to compare the execution time of each algorithm for the different fitness functions, three GPS location datasets are used for each algorithm: 1) GAK runs for the longest time and GenClust runs for the least time with SSE; 2) NicheClust runs for the longest time and GAK runs for the least time with DBI; 3) GAK runs for the longest time and GenClust runs for the least time with PBM, but GenClust has only two clusters corresponding to the lesser run time; and 4) GenClust runs for the longest time and NicheClust runs for the least time with COSEC.

Also, Table 4 shows the stability (S) results obtained for NicheClust, GenClust, and GAK for the three GPS location datasets. The stability of the clustering results indicates that higher values have lower output changes and are always preferable [17]. The table shows that the stability obtained by NicheClust is better than that obtained by GenClust and GAK, besides PBM in the GenClust algorithm, for the same GPS location datasets with the different fitness functions. A few of the stability values for GAK are good for small location datasets. However, Table 4 indicates that GAK has the lowest stability values for GPS location datasets with large amounts of data. Finally, the stability values for NicheClust are better than GenClust, except for PBM.

The experimental results show that the overall efficiency and performance of NicheClust is more effective than GenClust and GAK, and it is easy to find cluster centers which reveal people’s activities

Table 4

Comparison of three GA-based clustering algorithms (the average values of 20 independent experiments) using the four fitness functions on cluster evaluation criteria: SP, SC, S, and RT for the three GPS location datasets

Fitness	Algorithm	Beijing taxicab location dataset					Shanghai taxicab location dataset					Nanjing taxicab location dataset
		Evaluation criteria (Euclidean distance)
		Clusters	SP	Stability	SC	RT (min)	Clusters	SP	Stability	SC	RT (min)	Clusters	SP	Stability	SC	RT (min)
SSE	NicheClust	100	0.2674	0.02016	0.8414	12.2547	98	0.1832	0.02016	0.9084	4.9076	54	0.0313	0.02360	0.9025	1.3388
	GenClust	132	0.2476	0.01130	0.8601	7.3618	107	0.1790	0.01427	0.9124	3.3415	84	0.0305	0.01640	0.9194	1.1555
	GAK	67	0.2420	0.01994	0.7939	21.6382	35	0.1840	0.03840	0.8244	4.6430	30	0.0311	0.03952	0.8397	2.0761
DBI	NicheClust	108	0.2926	0.02561	0.8416	29.0765	87	0.1814	0.02525	0.8899	9.6187	50	0.0311	0.02979	0.8944	2.1439
	GenClust	132	0.3246	0.02360	0.8077	17.8876	107	0.2101	0.02485	0.8156	9.2778	84	0.0298	0.02168	0.8501	2.5044
	GAK	67	0.2140	0.02192	0.7806	10.8425	35	0.1762	0.04405	0.7950	2.2038	30	0.0312	0.04522	0.8233	1.2482
PBM	NicheClust	25	0.2670	0.06075	0.6499	5.3550	36	0.1982	0.04190	0.8267	2.9788	17	0.0346	0.07798	0.7790	0.6152
	GenClust	2	0.3612	0.87598	0.5832	0.7183	2	0.5085	0.96786	0.2887	1.5998	2	0.0484	0.74377	0.4422	0.2462
	GAK	67	0.2534	0.02138	0.7753	6.8029	35	0.2198	0.04437	0.7464	1.3520	30	0.0349	0.04430	0.7636	0.6360
COSEC	NicheClust	70	0.3390	0.05883	0.8103	157.0639	70	0.1993	0.03656	0.8793	76.1466	54	0.0319	0.02672	0.9031	15.7305
	GenClust	119	0.3070	0.02535	0.8441	311.2628	93	0.2048	0.02710	0.8565	103.4404	83	0.0306	0.01886	0.8877	24.4396
	GAK	67	0.2290	0.02357	0.7802	200.7065	35	0.1896	0.04763	0.8073	27.6160	30	0.0319	0.01886	0.8312	9.6090

and reflect the city’s traffic information. In particular, the whole clustering process of NicheClust is smooth and it is easy to obtain a global optimal solution. In addition, the NicheClust algorithm can be used to maintain population diversity and obtain global convergence.

To compare the dissimilarity between NicheClust, GenClust and GAK clustering algorithms for the results of SC, SP and stability, according to [63], a nonparametric statistical method (called the Wilcoxon rank sum test (WRST)) is used. The test determines whether the differences between the algorithms are significant when the GPS location datasets are independent. There are two outputs, $p$ , $h$ . Therefore, the test results are used to determine whether two GPS location data samples have the different distribution between clustering algorithms using the different fitness functions. The results are presented in Table 5.

Table 5

The WRST test results of SC, SP, and stability between NicheClust, GenClust, and GAK over 20 independent experiments. A significance level of $\alpha=$ 0.05 (5%) was used

Taxi GPS dataset	Fitness function	Evaluation criteria	NicheClust vs. GenClust		NicheClust vs. GAK
			$p$	$h$	$p$	$h$
Beijing	SSE	SC	0.0087	1	0.0034	1
		SP	0.3873	0	0.0088	1
		Stability	5.9068 $\times$ 10 ${}^{-4}$	1	0.1846	0
	DBI	SC	0.0099	1	0.0079	1
		SP	5.3890 $\times$ 10 ${}^{-4}$	1	0.0056	1
		Stability	0.0125	1	0.1876	0
	PBM	SC	0.0066	1	5.3589 $\times$ 10 ${}^{-4}$	1
		SP	4.5633 $\times$ 10 ${}^{-4}$	1	0.0091	1
		Stability	1.0025 $\times$ 10 ${}^{-4}$	1	1.0011 $\times$ 10 ${}^{-4}$	1
	COSEC	SC	0.0223	1	0.0486	1
		SP	0.0315	1	5.8820 $\times$ 10 ${}^{-4}$	1
		Stability	4.6984 $\times$ 10 ${}^{-4}$	1	2.1180 $\times$ 10 ${}^{-4}$	1
Shanghai	SSE	SC	5.6381 $\times$ 10 ${}^{-4}$	1	0.0011	1
		SP	0.0042	1	0.4851	0
		Stability	5.6271 $\times$ 10 ${}^{-4}$	1	5.9971 $\times$ 10 ${}^{-4}$	1
	DBI	SC	0.0258	1	0.0097	1
		SP	0.0389	1	0.2694	0
		Stability	0.0892	0	5.2251 $\times$ 10 ${}^{-4}$	1
	PBM	SC	3.9921 $\times$ 10 ${}^{-5}$	1	0.0027	1
		SP	6.3328 $\times$ 10 ${}^{-5}$	1	0.0096	1
		Stability	1.001 $\times$ 10 ${}^{-5}$	1	0.0086	1
	COSEC	SC	0.0057	1	5.5529 $\times$ 10 ${}^{-4}$	1
		SP	0.0458	1	0.0008 ${}^{4}$	1
		Stability	4.1850 $\times$ 10 ${}^{-4}$	1	0.0015	1
Nanjing	SSE	SC	0.0059	1	0.0064	1
		SP	0.2351	0	0.4112	0
		Stability	5.7648 $\times$ 10 ${}^{-4}$	1	5.2877 $\times$ 10 ${}^{-4}$	1
	DBI	SC	5.2651 $\times$ 10 ${}^{-4}$	1	5.9827 $\times$ 10 ${}^{-4}$	1
		SP	0.0768	0	0.0426	1
		Stability	0.0010	1	5.5841 $\times$ 10 ${}^{-4}$	1
	PBM	SC	0.0462	1	5.1002 $\times$ 10 ${}^{-4}$	1
		SP	0.1568	0	0.0358	1
		Stability	0.0022	1	5.6082 $\times$ 10 ${}^{-4}$	1
	COSEC	SC	0.0055	1	5.1025 $\times$ 10 ${}^{-4}$	1
		SP	0.0092	1	0.0511	0
		Stability	0.0021	1	0.0023	1

In WRST, $h$ is used to denote the nonparametric statistical test decision which returns a 0 or 1. If $h=$ 1, then this is a rejection of the null hypothesis, and if $h=$ 0 this is a failure to reject the null hypothesis at the significance level $\alpha$ (typically equal to 5%). The value $p$ denotes the null hypothesis test that the different clustering algorithms (NicheClust versus GenClust and NicheClust versus GAK) use; as $p\to$ 0, the differences between the algorithms are more evident. Therefore, it seems that the overall difference in terms of the evaluation results of SC, SP and stability between the algorithms is significant. For example, in this paper, SC, SP, and stability are used to evaluate the clustering results of the taxi GPS datasets. When the DBI fitness function is used to handle GA-based clustering operations, both the $p$ -values (0.0099, 5.3890 $\times$ 10 ${}^{-4}$ , 5.5325 $\times$ 10 ${}^{-4}$ , 0.0125) and $h$ values ( $h=$ 1, 1, 1, 1) between NicheClust and GenClust indicate the rejection of the null hypothesis at the default 5% significance level. In addition, both $p>$ 5% and $h=$ 1 indicates that there is not sufficient evidence to reject the null hypothesis (e.g., when SSE is used to perform NicheClust and GenClust for the Beijing dataset, both $p=$ 0.3873 and $h=$ 0). Therefore, the nonparametric statistical test results based on WRST indicate that NicheClust is a new GA-based clustering algorithm relative to GenClust and GAK.

6. Conclusions

Clustering usually provides a common means of identifying structure in a given data set. However, how to estimate the proper number of clusters, find better cluster centers and obtain an optimal solution have been recognized as the most difficult problems in cluster analysis. In this paper, we have presented a novel automatic K-means clustering algorithm based on a niching genetic algorithm with hybrid K-means for clustering problems. The proposed algorithm can determine the number of clusters and the selection of initial seeds, and can avoid local convergence. The evidence of the experimental evaluation and statistical analysis indicate that NicheClust provides solutions of higher quality and better performance with equivalent computational resources for three GPS location datasets. However, NicheClust still has a high computational cost (see Table 4). A novel clustering algorithm based on fuzzy operators with noise will be developed in the future to improve this cost.

Footnotes

Acknowledgments

This paper was supported by the Sichuan Science and Technology Program (2018GZ0177, 2016ZR 0129), the Research and Innovation Team of Universities and Colleges in Sichuan Province of China (15DT0039), Scientific Research and Innovation Team of Sichuan Tourism University (18SCTUTD06), the Provincial Discipline Open Platform Project of Xi Hua University (SZJJ2015-060).

References

Abido

, A niched Pareto genetic algorithm for multiobjective environmental/economic dispatch, Int. J. Elec. Rower25 (2003), 97–105.

Abul Hasan

M.J.

and Ramakrishnan

, A survey: hybrid evolutionary algorithms for cluster analysis, Artificial Intelligence Review36 (2011), 179–204.

Agustı

Salcedo-Sanz

Jiménez-Fernández

Carro-Calvo

Del Ser

and Portilla-Figueras

J.A.

, A new grouping genetic algorithm for clustering problems, Expert Syst. Appl39 (2012), 9695–9703.

Akodjènou-Jeannin

M.-I.

Salamatian

and Gallinari

, Flexible grid-based clustering, in: Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 2007, pp. 350–357.

Arthur

and Vassilvitskii

, k-means++: The advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.

Bai

Cheng

Liang

Shen

and Guo

, Fast density clustering strategies based on the k-means algorithm, Pattern Recognition71 (2017), 375–386.

Bandyopadhyay

and Maulik

, An evolutionary technique based on K-means algorithm for optimal clustering in RN, Inform. Sciences146 (2002), 221–237.

Beal

Viroli

and Damiani

, Towards a unified model of spatial computing, in: Proceedings of the 7th Spatial Computing Workshop (SCW 2014), AAMAS, 2014.

Brinkhoff

, A framework for generating network-based moving objects, GeoInformatica6 (2002), 153–180.

10.

Celebi

M.E.

Kingravi

H.A.

and Vela

P.A.

, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Syst. Appl40 (2013), 200–210.

11.

Chang

D.-X.

Zhang

X.-D.

and Zheng

C.-W.

, A genetic algorithm with gene rearrangement for K-means clustering, Pattern Recognition42 (2009), 1210–1222.

12.

Chang

D.-X.

Zhang

X.-D.

Zheng

C.-W.

and Zhang

D.-M.

, A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem, Pattern Recognition43 (2010), 1346–1360.

13.

Davies

D.L.

and Bouldin

D.W.

, A cluster separation measure, IEEE. Trans. Pattern. Anal (1979), 224–227.

14.

Deb

and Goldberg

D.E.

, An investigation of niche and species formation in genetic function optimization, in: Proceedings of the 3rd International Conference on Genetic Algorithms, Morgan Kaufmann Publishers Inc., 1989, pp. 42–50.

15.

Deng

Zhao

Zou

Yang

and Wu

, A novel collaborative optimization algorithm in solving complex optimization problems, Soft Computing21 (2017), 4387–4398.

16.

Deng

Zhu

Huang

and Du

, A scalable and fast OPTICS for clustering trajectory big data, Cluster Comput18 (2014), 549–562.

17.

Fahad

Alshatri

Tari

Alamri

Khalil

Zomaya

A.Y.

Foufou

and Bouras

, A survey of clustering algorithms for big data: Taxonomy and empirical analysis, IEEE. Trans. Emer. Topi. Comput2 (2014), 267–279.

18.

Han

Pei

and Kamber

, Data mining: concepts and techniques, Elsevier, 2011.

19.

Hancer

and Karaboga

, A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number, Swar. Evolu. Comput32 (2017), 49–67.

20.

Hasan

and Ukkusuri

S.V.

, Urban activity pattern classification using topic models from online geo-location data, Transport. Res. C-Emer44 (2014), 363–381.

21.

Hung

C.-C.

Peng

W.-C.

and Lee

W.-C.

, Clustering and aggregating clues of trajectories for mining trajectory patterns and routes, Vldb. J. – The International Journal on Very Large Data Bases24 (2015), 169–192.

22.

Jain

A.K.

, Data clustering: 50 years beyond K-means, Pattern Recognition Letters31 (2010), 651–666.

23.

Krishna

and Murty

M.N.

, Genetic K-means algorithm, IEEE. Trans. Syst., Man, Cybern., Part B (Cybernetics)29 (1999), 433–439.

24.

Lin

H.-J.

Yang

F.-W.

and Kao

Y.-T.

, An efficient GA-based clustering technique, Tamkang. J. Sci. Eng8 (2005), 113–122.

25.

Liu

and Shen

, Automatic clustering using genetic algorithms, Applied Mathematics And Computation218 (2011), 1267–1279.

26.

Liang

Wang

and Yuan

, Exploring OD patterns of interested region based on taxi trajectories, J. Visual19 (2016), 811–821.

27.

Luo

Zheng

and Ren

, An improved DBSCAN algorithm to detect stops in individual trajectories, ISPRS Int. J. Geo-Inf6 (2017), 63.

28.

Martín

Alcalá-Fdez

Rosete

and Herrera

, NICGAR: a niching genetic algorithm to mine a diverse set of interesting quantitative association rules, Inform. Sciences355 (2016), 208–228.

29.

Maulik

and Bandyopadhyay

, Genetic algorithm-based clustering technique, Pattern Recognition33 (2000), 1455–1465.

30.

McCallum

Nigam

and Ungar

L.H.

, Efficient clustering of high-dimensional data sets with application to reference matching, in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2000, pp. 169–178.

31.

Mukhopadhyay

and Maulik

, Towards improving fuzzy clustering using support vector machine: Application to gene expression data, Pattern Recognition42 (2009), 2744–2763.

32.

R.T.

and Han

, CLARANS: A method for clustering objects for spatial data mining, IEEE Trans. Knowl. Data. Eng14 (2002), 1003–1016.

33.

Pakhira

M.K.

Bandyopadhyay

and Maulik

, Validity index for crisp and fuzzy clusters, Pattern Recognition37 (2004), 487–501.

34.

Pang

L.X.

Chawla

Liu

and Zheng

, On detection of emerging anomalous traffic patterns using GPS data, Data & Knowledge Engineering87 (2013), 357–373.

35.

Parent

Spaccapietra

Renso

Andrienko

Bogorny

Damiani

M.L.

Gkoulalas-Divanis

Macedo

and Pelekis

, Semantic trajectories modeling and analysis, ACM Comput. Surv (CSUR)45 (2013), 42.

36.

Patwary

M.M.A.

Palsetia

Agrawal

Liao

W.-k.

Manne

and Choudhary

, Scalable parallel OPTICS data clustering using graph algorithmic techniques, in: Proceedings of the 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), IEEE, 2013, pp. 1–12.

37.

Patwary

M.M.A.

Palsetia

Agrawal

Liao

W.-k.

Manne

and Choudhary

38.

Pelekis

Kopanakis

Kotsifakos

E.E.

Frentzos

and Theodoridis

, Clustering uncertain trajectories, Knowl. Inf. Syst28 (2011), 117–147.

39.

Rahman

and Islam

, Seed-detective: A novel clustering technique using high quality seed for K-means on categorical and numerical attributes, in: Proceedings of the Ninth Australasian Data Mining Conference-Volume 121, Australian Computer Society, Inc., 2011, pp. 211–220.

40.

Rahman

M.A.

and Islam

M.Z.

, CRUDAW: a novel fuzzy technique for clustering records following user defined attribute weights, in: Proceedings of the Tenth Australasian Data Mining Conference-Volume 134, Australian Computer Society, Inc., 2012, pp. 27–41.

41.

Rahman

M.A.

and Islam

M.Z.

, A hybrid clustering technique combining a novel genetic algorithm with K-Means, Knowl. Based. Syst71 (2014), 345–365.

42.

Redmond

S.J.

and Heneghan

, A method for initialising the K-means clustering algorithm using kd-trees, Pattern Recognition Letters28 (2007), 965–973.

43.

Sareni

and Krahenbuhl

, Fitness sharing and niching methods revisited, IEEE. Trans. EVolut. Comput2 (1998), 97–106.

44.

Scholz

R.W.

and Lu

, Detection of dynamic activity patterns at a collective level from large-volume trajectory data, International Journal Of Geographical Information Science28 (2014), 946–963.

45.

Sclim

and Lsmailm

, Means-type algorithm: a generalized convergence theorem and characterization of local optimality, IEEE. Trans. Pattern. Anal (1984), 81–87.

46.

Shekhar

Feiner

and Aref

W.G.

, From GPS and virtual globes to spatial computing-2020, GeoInformatica19 (2015), 799–832.

47.

Shekhar

Feiner

S.K.

and Aref

W.G.

, Spatial computing, Commun. ACM59 (2015), 72–81.

48.

Sheng

Tucker

and Liu

, A niching genetic k-means algorithm and its applications to gene expression data, Soft. Comput14 (2010), 9.

49.

Spaccapietra

Parent

Damiani

M.L.

de Macedo

J.A.

Porto

and Vangenot

, A conceptual view on trajectories, Data & Knowledge Engineering65 (2008), 126–146.

50.

Wang

Liu

Ranjan

and Chen

, IK-SVD: dictionary learning for spatial big data via incremental atom update, Computing In Science & Engineering16 (2014), 41–52.

51.

Wei

and Zhao

, A niche hybrid genetic algorithm for global optimization of continuous multimodal functions, Applied Mathematics And Computation160 (2005), 649–661.

52.

Wilson

M.W.

, Location-based services, conspicuous mobility, and the location-aware future, Geoforum, Journal of Physical, Human, and Regional Geosciences43 (2012), 1266–1275.

53.

Wilson

M.W.

, Geospatial technologies in the location-aware future, J. Trans. Geogr34 (2014), 297–299.

54.

and Wunsch

, Survey of clustering algorithms, IEEE. Trans. Neural. Networ16 (2005), 645–678.

55.

Yang

and Wu

, 10 challenging problems in data mining research, Int. J. Inf. Tech. Dec., Mak5 (2006), 597–604.

56.

Yang

and Shen

, Between morphology and function: How syntactic centers of the Beijing city are defined, J. Urba. Manage4 (2015), 125–134.

57.

Yuan

Zheng

Xie

and Sun

, T-drive: Enhancing driving directions with taxi drivers’ intelligence, IEEE Trans. Knowl. Data. Eng25 (2013), 220–232.

58.

Yuan

N.J.

Zheng

Zhang

and Xie

, T-finder: A recommender system for finding passengers and vacant taxis, IEEE Trans. Knowl. Data. Eng25 (2013), 2390–2403.

59.

Zhang

and Wang

, POI recommendation through cross-region collaborative filtering, Knowl. Inf. Syst46 (2016), 369–387.

60.

Zhang

and Zhou

, A novel clustering algorithm combining niche genetic algorithm with canopy and K-means, in: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), IEEE, 2018, pp. 26–32.

61.

Zhao

Shi

Liu

and Fränti

, A grid-growing clustering algorithm for geo-spatial data, Pattern Recognition Letters53 (2015), 77–84.

62.

Zheng

Liu

Yuan

and Xie

, Urban computing with taxicabs, in: Proceedings of the 13th International Conference on Ubiquitous Computing, ACM, 2011, pp. 89–98.

63.

Zhou

Shen

Miao

Zhang

and Gong

, An automatic k-means clustering algorithm of GPS data combining a novel niche genetic algorithm with noise and density, ISPRS Int. J. Geo-Inf6 (2017), 392.

A GPS location data clustering approach based on a niche genetic algorithm and hybrid K-means

Abstract

Keywords

1. Introduction

3. Preliminaries

4. The NicheClust clustering algorithm

4.1 The main contributions of NicheClust

4.1.1 NGA chromosome representation

4.1.3 Initial cluster centers (Seeds) of NicheClust with K-means++

4.1.4 Cosine-based similarity for an improved gene rearrangement technique

4.1.5 Niche partitioning based on the improved canopy method

4.2 Fitness functions of the NGA

Table 1 The fitness functions used in NGA

4.4 Complexity of NicheClust

Table 2 The experimental GPS location datasets

5.2 Experimental results

Table 3 The evaluation criteria

Footnotes

Acknowledgments

References

Table 1
The fitness functions used in NGA

Table 2
The experimental GPS location datasets

Table 3
The evaluation criteria