Abstract
Fuzzy c-means is one of the most popular partitional clustering. However, it has the shortcoming that it is sensitive to initial centers and noises. Density-based clustering algorithm overcomes this shortcoming, but cannot obtain the better clustering results when the density of data space has uneven distribution. Grid-based method is advantageous to save computational time, but the clustering performance was unsatisfied. Based on the above analysis, the improved FCM algorithm based on initial center optimization method is proposed. First, the initial center optimization method based on density and grid is presented to avoid the sensitivity of FCM to initial centers. Then, improved FCM algorithm based on initial center optimization method is proposed. Finally, the performance and effectiveness of the proposed clustering algorithm is evaluated by 4 San Francisco taxi GPS cab mobility traces data sets, and the experimental results show that the proposed algorithm has better clustering results.
Introduction
In the traffic system, we analyze cab mobility traces to provide support for the gas station’s location, and optimize the traffic operation system. The cab mobility traces should be analysis by using the data mining technology, and according to obtained information the reasonable setting of the gas station’s location which make cab mobility has the least the time and cost. Due to the cab mobility traces is a continuous trajectory, the density-based clustering methods [1–8] will classify the whole trace into a cluster, therefore density-based clustering cannot find the reasonable cluster center (the possible positions of gas station). The Fuzzy c-means (FCM) [9–16] is based on objective function, and has the advantages of the low time complexity and easy to implement. Therefore, FCM has extensive application, especially in the field of traffic system. However, FCM clustering method has some shortcomings: The users should input the number of clusters c advanced. The clustering algorithm is sensitive to the initial cluster centers. The different clustering results by using FCM with different initial centers are shown as Fig. 1. The clustering algorithm is sensitive to the noise. The clustering results by using FCM and the real results are shown as Fig. 2.

The clustering results of different initial centers by using FCM.

The clustering results with a noise.
In order to overcome the shortcomings, the improved FCM algorithm based on initial center optimization method is proposed. First, the data space is divided into some scattered data sets according to the location of the data points by using grid-based method. Then, the data points are endowed with different weights based on the distances to the center. Thirdly, the cells are merged based on the density and location of cells until the number of cells reaches a specified number. The initial centers of FCM are obtained by calculating the centers of the final cells. The obtained centers as the initial centers overcome the drawback of FCM that is sensitive to the initial value. The performance and effectiveness of the proposed clustering algorithm is evaluated by 4 San Francisco taxi GPS cab mobility traces data sets, and the experimental results show that the proposed algorithm has better clustering results.
The rest of this paper is organized as follows. Section 2 introduces fuzzy c-means clustering. Section 3 proposed the initial center optimization method based on density and grid. Section 4 presents improved FCM algorithm based on initial center optimization method, and Section 5 reports the experimental results. Finally, Section 6 concludes this work.
Fuzzy c-means (FCM) classifies the dataset X = {x1, …, x N } into c data clusters as fuzzy sets (F i ). The objective of FCM is to accurately represent pattern matrix U = [μ ij ] and cluster centers V = [v i ] via a minimizer (U, V) of the evaluation function J m .
The algorithm applies an optimization iteration to minimize the evaluation function J
m
, while pattern matrix U and cluster centers V are updated. The update formulas of the values of U and V are derived easily by Lagrange multipliers as following equations,
An abstract procedure of this process is listed in Algorithm 1.
Dataset X = {x1, x2, …, x N }, the number of clusters c, a certain threshold ɛ, fuzzy factor m
A pattern matrix U and cluster centers V
1: Initialize fuzzy partition matrix U i = [μ ij ] for i = 1, 2, …, c such that Σμ ij = 1;
2: objective function objfcn = realmax;
3: error = realmax;
4:
5: Calculate the cluster centers v i by
6:
7: Update the fuzzy membership μ ij by
8:
9: Calculate new value of objective function newobjfcn by
10:
11: error = abs (newobjfcn - objfcn);
12: objfcn = newobjfcn;
13:
Density-based clustering put all density-connected data and other data which are within this range into a cluster so as to cluster. Due to the cab mobility traces is a continuous trajectory, the density-based clustering cannot be used directly to clustering the cab mobility traces. This kind of clustering methods will classify the whole trace into a cluster, therefore density-based clustering cannot find the reasonable cluster center (the possible positions of gas station).
Grid-based clustering [17–24] partitions the data space into a finite number of cells, and creates a grid structure. Then the cluster centers can be obtained according to the density of cells. The processing unit is no longer a data point but an operational data set, therefore the computational complexity is significantly reduced by using this kind of clustering methods. However, this kind of methods have a known shortcoming is that the quality and accuracy of clustering is low.
Based on the above analysis, this paper presents an initial center optimization method that combines the density-based method with grid-based method. The partition method overcomes the shortcomings of density-based method’s limitations and grid-based method’s low accuracy rate. First, the data space is divided into some scattered data sets according to the location of the data points by using grid-based method. Then, the data points are endowed with different weights based on the distances to the center. Thirdly, the cells are merged based on the density and location of cells until the number of cells reaches a specified number. The initial centers of FCM are obtained by calculating the centers of the final cells. The obtained centers as the initial centers overcome the drawback of FCM that is sensitive to the initial value. The next will be introduced some basic definitions of the method.
cell
i
∩ cell
j
= Φ, 1 ≤ i ≠ j ≤ m; cell1 ∪ cell2 ∪ … ∪ cell
m
= D.
Data set X = {x1, x2, …, x n }, the number of clusters c
A pattern matrix U and cluster centers V
1: Apply grid-based method to partition the data space into m cells;
2:
3: Calculate the center of the cell by Eq. (4);
4: Calculate the weight of data in the cell by Eq. (5);
5: Calculate the density of the cell by Eq. (6);
6:
7: Let the last density cells be the noise cells;
8: Calculate the similarity degree between arbitrary two adjacent cell by Eq. (7)-Eq. (9);
9:
10: Merge the two cells which have the maximum similarity;
11: Calculate the similarity degree between the new cell with other adjacent cell by Eq. (7)-Eq. (9);
12:
13: Calculate the centers of the final cells as c initial centers;
14: Apply the FCM algorithm to classify the given data set.
From the above we can see that the value of SD (cell i , cell j ) is bigger, the more density similar of the two cells, and the nearer the value of SL (cell i , cell j ) is to 1, the closer the cell is to one other. Therefore, the bigger the value of S (cell i , cell j ) is, the more similar of the two cells.
The main shortcoming of FCM is that it is sensitive to the initial value which caused the algorithm is easy to fall into local optimum. In order to overcome this shortcoming, the improved FCM algorithm based on initial center optimization method is proposed. The abstract procedure of the algorithm is listed in Algorithm 2, and the next will be described the detail steps of the algorithm.
Let [25] be the number of cells, and use equidistance parallel horizontal and vertical to divide the data space. Create the grid structure, i.e., partition the data space into m cells.
Experimental results
In this section, the performance and effectiveness of the proposed clustering algorithm is evaluated by the real data sets. The data sets include 4 San Francisco taxi GPS cab mobility traces data sets, named as Tra1, Tra2, Tra3, and Tra4. The cab mobility traces data sets includes 23495 data points, 5454 data points, 21962 data points, 22792 data points, respectively. The next will compare the proposed algorithm with FCM to cluster these data sets.
For the four data sets, FCM algorithm run 100 times with the number of clusters c equals 5, and calculated the average of the clustering results. The proposed algorithm classifies the same data sets, and compares the clustering results with the FCM. The following Tables 1–4 are comparisons of the clustering results of FCM and the proposed algorithm. Tables 1–4 select 5 set clustering results by FCM for data sets Tra1, Tra2, Tra3, and Tra4, respectively. Then, we compare the average of the 5 set clustering results with the proposed algorithm. From Table 1–4, we can conclude that the clustering centers find by the proposed algorithm is relatively stablethan FCM.
The clustering centers of FCM and the proposed algorithm for Tra1 data set
The clustering centers of FCM and the proposed algorithm for Tra1 data set
The clustering centers of FCM and the proposed algorithm for Tra2 data set
The clustering centers of FCM and the proposed algorithm for Tra3 data set
The clustering centers of FCM and the proposed algorithm for Tra4 data set
Figures 3–6 show the clustering results by using the proposed algorithm and FCM for data sets Tra1, Tra2, Tra3, and Tra4, respectively. The data points in figures are the cab location information at different times. The figure’s horizontal axis and vertical axis are the latitude and longitude of the cab location, respectively. Figure 3 shows that the clustering results of the proposed algorithm merge some segmentation clusters, and the connected data points are assigned to the same cluster. Figures 4–6 also show the same conclusion. Therefore, from Figs. 3–6, we can see that the clustering results more reasonable, and more in line with the actual situation than FCM. Therefore, we can conclude that the proposed algorithm has better clustering results than the FCM algorithm.

The clustering results by using the proposed algorithm and FCM for Tra1 data set.

The clustering results by using the proposed algorithm and FCM for Tra2 data set.

The clustering results by using the proposed algorithm and FCM for Tra3 data set.

The clustering results by using the proposed algorithm and FCM for Tra4 data set.
The experimental results show that the proposed algorithm has the following advantages: From Tables 1–4, we can see that the average of the clustering centers obtained by FCM and the centers of the proposed algorithm are nearly equal. Therefore, the proposed algorithm based on initial center optimization method is more stable than FCM. From Figs. 3–6, we can see that the proposed algorithm assigns the connected data to the same cluster. Therefore, the clustering centers find by the proposed algorithm more matches the actual situation. From Figs. 3–6, we can see that the clustering results is not affected by the noise. Therefore, the proposed algorithm is not sensitive to the noises. From Figs. 3–6, we can see that the clustering results of the proposed algorithm have higher clustering accuracy. Therefore, the proposed algorithm overcomes low quality and accuracy shortcomings of the grid-based method.
This paper presents an initial center optimization method that combines the density-based method with grid-based method. The partition method overcomes the shortcomings of density-based method’s limitations and grid-based method’s low accuracy rate. First, the data space is divided into some scattered data sets according to the location of the data points by using grid-based method. Then, the data points are endowed with different weights based on the distances to the center. Thirdly, the cells are merged based on the density and location of cells until the number of cells reaches a specified number. The initial centers of FCM are obtained by calculating the centers of the final cells. The obtained centers as the initial centers overcome the drawback of FCM that is sensitive to the initial value. The performance and effectiveness of the proposed clustering algorithm is evaluated by 4 San Francisco taxi GPS cab mobility traces data sets, and the experimental results show that the proposed algorithm has better clustering results.
