Abstract
With the rapid development of urbane-centered economy, urban area has gone through strong but heterogeneous sprawl. In such complex urban systems, it is impossible to established teaching centers of night school in every district of city for continuing education programs. Part-time students tend to be educated in popular locations of city due to convenience. Since call logs and geographical nature of mobile phone data can provide an opportunity to measure human behavior and social dynamics, we investigate how to infer urban popular locations with large-scale quasi-social network for avoiding the limitation of data collection and even privacy problems. A large-scale quasi-social network model is developed via measuring the number of shared-user between zones, which is different from previous models for social network. We first verify whether or not this model also can show the social structure of given data, the ranking of places in the model have been calculated based on eigvalue metric. To understand the connections between popular locations of human activity and spatial structure, we present a method to infer the core zones in given region, and then we use a simple metric to evaluate the most popular locations of human activity.
Introduction
Since smart city have been recognized by many countries in recent years, urban spatial structure [1, 2] that measured by the degree of spatial concentration of population and employment, would have many meaningful applications in a great variety of fields, including urban planning [3, 4], public transport planning [5–7] and locate-based recommendation [8, 9]. Because the organizations and the functions and role of different areas of city [10] in people’s life can be described more accurately form the insight into urban spatial structure of a city, the study of urban spatial structure provided a great potential for the locate-based mobile applications such as Wechat and Groupon to improve the user experiment of existing services, on the other side, understanding the spatial concentration of population and employment would help the urban planners to optimize the value of existing infrastructures in the city [11]. In particular, understanding popular locations in urban would help educators to establish teaching centers for popularizing continuing education programs.
Two key features of urban spatial structure are core zones and hot locations that were respectively defined as the sets of those popular locations where people like to go and the most popular location of different core locations in given area. Consequently, inferring the core zones and hot locations in city is an important research goal.
Since urban spatial structure that consists of these certain places represents the spatial concentration of various types of activities generated from individual people, general principle inferring the core zones or hot locations could incorporate simple statistical metrics. For example, the total number of individuals or human activities near given place would be a direct index to rank the top destinations for individuals in this area. Nonetheless, due to the limitation of data collection and even privacy problems, the lack of the completeness and the credibility of available data causes the ranking results seem to be subjective. The place entropy [12, 13] representing the individual diversity of given place also can be used to rank the popular places, however, due to the same lack of available data, whether or not the place entropy can infer the core zone and hot spot remains unknown. More specially, one challenge remains in inferring the core zone and hot location in static certain partitions have an entire region through modeling the limited data and thus avoiding privacy problem. For example, the telecoms providers or locate-based commercial application providers have sufficient incentives to partition the urban spatial structure and further discover the core zones and hot spots in partitions for their business.
Another challenge in inferring the hot locations in different partitions is dynamically. For example, a cellphone user wants to search the most popular place on his mobile phone within a 1000 meters’ radius around his current location. However, the public data generated from regional demographic survey usually provide a static perspective to rank the hot places, and the service providers urge a dynamic approach to calculate the hottest place in the selected dataset. Consequently, extrapolating hot spots formed by human activity patterns across the given dynamic region requires developing quantitative models shape local activity patterns based on as few types of available data.
There is a set of studies that examine the links between urban spatial structure and human activity patterns. In [14], the results imply the community formation between locations in a mobile phone communication network is related to geographic context, including social structure, wealth distribution, economic production and land use. In [1, 15], the urban spatial-temporal structure was analyzed by using activity-based travel survey data. These studies reveal spatial proximity can also promote community formation in networks of individual people to large-scale social networks in which network nodes represent a population of people at a given location. Moreover, individual attributes such as homophily and focus constraints in located-based social network [16] and the communication interactions among mobile towers [17] had been used to shape the edges of the large-scale social network, where each node represents co-locations between users or mobile towers [18–20].
The methodology
Dataset
We consider a district in Wuhan named Wuchang as an example to begin the study about inferring hot locations in urban. Wuhan is the most popular city in Central China, which was sometimes referred to as the “Chicago of China”. Wuchang District is one of the seven central districts that merged into Wuhan. Then, in our research, an anonymous dataset was used to evaluate the shape of human activity patterns. This dataset, that accounted for approximately 25% of the population and was collected by a Chinese telecom operator, are composed of the user lists that call from 219 mobile phone towers (Fig. 1) in Wuchang District which cover about 20% of the total land area of Wuhan. As shown, the 219 mobile phone towers are distributed on the most regions of Wuchang district, which are about 3.5 square miles. There are five known centre business districts (CBD) in this area as follows: 1 represents Zhongnan, 2 is Jiedaokou, 3 is Huquan, 4 is Luxiang, and 5 is Jinronggang. Many functional areas such as univsersity campus or residential settlements located among these CBDs. In common sense, we can observe heavy traffic among these mobile phone towers in this area.
In the dataset, there are about 36 million (35610100) mobile phone communications occurred between 2 weeks in august 2012 and September 2012, for each call, the tower used by the phone initiating the call and the tower used by the telephone receiving the call were recorded. The records also contain the ID of mobile phone tower and the call time. As mentioned in previous section, the communication traffic records of towers can reflect the human activity patterns within the spatial dimension to some extent. Since it do represent some types of individual activity, we first consider two basic characters of the communication traffic record for each tower during the 2 weeks, namely, the total number of communication records and the total number of mobile phone users. As shown in Fig. 2, it can be obvious that there is a positive correlation between the two basic characters. The vertical axis presents the total number of communication records (C1, blue line) and the total number of mobile phone users (C2, red line). As shown, about 8 towers have higher levels than the other towers. More specially, Table 1 show the top 8 towers of user number base overlaps with that of call number, and these towers located respectively in different CBD such as Zhongnan, Jiedaokou and Luxiang. These characters are congenial with reason and common sense.
User diversity and Eigenvector Centrality of the locations
We also consider a potentially better metric to discover the popular locations in this area. In [21], the author proposes the entropy of the “venues” [22] can be applied to capture its user diversity generated from different visitors among those ventures. So we calculate this entropy e of mobile phone towers by the communication traffic records in similar way:
Since the co-user of mobile phone towers can represent the social interaction among the corresponding places of such towers, we define a large-scale quasi-social network that all mobile phone towers c represent the set of its nodes. And the edge of node i, j be
And p is the threshold via the distribution of the co-user number between all nodes.
Therefore, this undirected binary network has been developed to infer those popular locations in this dataset via the shared-individuals (co-users). We chose the threshold p = 2000 and thus obtain the adjacent matrix of this binary network. We analyze the basic properties of this undirected binary network under different p value to understand the availability of our following analysis by the network model. When p = 2000, we found the mean of degree (MD) is 58.174, its maximum modularity is about 0.16-0.17(when p = 500, 1000, 1500, the corresponding MD is 137, 98, 74.8. the corresponding maximum modularity is about 0.07, 0.11, 0.14). Furthermore, we also consider the undirected weighted network that the weights of edges are determined by the level of co-users between nodes. We found the weight of edges in the network structure had not heavy influence on the performance of community partition.
Since there will always be some noises (ex: the activity from deliveryman) and data limitation as mentioned above, we can use several known measurements in social network theory to discover the hot locations in given area based on the proposed quasi-social network model.
An intuitionistic metric is Eigenvector centrality [23], which is an effective measure of the influence of a node in a network (especially undirected network). It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Google’s PageRank algorithm is a variant of the Eigenvector centrality measure. In our large-scale quasi-social network, the top-ranking nodes that were evaluated via eigenvector centrality represent the relative popular locations which have more active calls [24, 25].
Core zones and popular locations
Furthermore, a feasible approach to infer the hot locations of different zones in the area, which captures our attention. Firstly, the modularity maximum algorithms [26] for detecting community in social network have been applied to divide given area to different certain quantity areas (about 3–7), which can be identified as the cluster of locations that have more social interaction among them since they have a large enough number of co-users in this large-scale quasi-social network.
The modularity Q can be written in the following form:
where Ai,j is the weight of edge between node, k i is the degree of node i, c i is the membership of node i, δ (u, v) is a Boolean value 1 if u = v, otherwise, 0. m is the total number of edges in the network.
Then, to infer the core zones in every activity areas with different partitions, an improved approach of modularity algorithm through repeated calculating the partitions of activation areas under various parameters has been proposed. The approach can be described as follows:
Given a partition for the x-th time , where the total number of activity areas k x = |c x |. After executing SA method [27] X times, we get a membership matrix M = K × X, where K = max(k x ), x ∈ X. Its element M ij is the membership of tower i that be allocated on the j-th implementation, and M ij < K.
We further obtain a set sequence of core zones , C is the total number of location set. For a set of core zones,
At last, we use a new metric called KCM via k-means and cosine similarity to verify the above analysis and further obtain more clearly popular location. Specially, the cosine similarity provides a reasonable measure to evaluate the level of similarity on the human activity between two places.
Given the co-users U ij of any two towers, U ij = U i ∩ U j . The cosine similarity between can be represented as
where is the probability of the user u k active on the tower.
To each partition, the sequence of popularlocation can be obtained by thefollowing rules: .
Results
Although the location of a subject was collected only when the subject was connected to the cellular network, location traces from mobile phone data have been shown to be a reasonable proxy for individual human mobility, we firstly check whether the entropy of each locations that measured by the diversity of visitors of corresponding mobile phone towers can indicate the hot locations in this region. As shown in Fig. 3 and Table 2, the popular locations ranked by entropy of venues appears to have similar with two previous basic characters C1 and C2, since we also found that there are slightly different between these measurements. Given the principle of place entropy, to a large extent we hold that the level of place entropy totally dependent on the popularity of corresponding location. However, the lack of data caused by the market share of our mobile phone operator and subjects’ call plans, limit the availability of place entropy inevitably.
Therefore, applying eigenvector centrality to this binary network in Gephi [28], we found the top-ranking hot locations were almost exactly like the obtained results via previous three metrics. Furthermore, as Fig. 4 shows, the corresponding towers of top-ranking hot regions distributed mostly near three popular CBDs, respectively is Zhongnan, Jiedaokou, and Luxiang). However, as described in the above section, this region has five known CBDs that can be further identified. The different areas in the region have their respective hot locations.
Based on our large-scale quasi-social network, the different partitions of activity areas have been calculated using the modularity maximum algorithms. More specially, we calculate the activity areas under various parameter k, the results are Fig. 5, As shown in Fig. 5, the partition of activity areas with K = 3, 5, 7 availably reflect the distribution of activity from the different scale in this region. In Fig. 5.1, the green dots cover the activity area near Luxiang and Jinronggang, the red dots cover the activity zone near Huquan, and the blue dots cover the activity zone near Zhongnan and Jiedaokou. To higher scale, the partitions of activity zones presented in Figs. 5.2 and 5.3 also imply that to some extent the activity areas were composed of different regions should be formed around some core regions such as CBDs.
Using the above approach, we calculate the core zones in this region as Fig. 6. As shown in Fig. 6, the core zones distribute basically within the range of the five CBDs. In general, the popular locations such as CBDs would be considered as a popular spot in every activity zone, that is to say, the residents in an activity zone have some locations that can hold their interest such as shopping and entertainment.
Using the above approach, we calculate the core zones in this region as Fig. 6. As shown in Fig. 6, the core zones distribute basically within the range of the five CBDs. In general, the popular locations such as CBDs would be considered as a popular spot in every activity zone, that is to say, the residents in an activity zone have some locations that can hold their interest such as shopping and entertainment.
Discussions
Since mobile phones can provide an opportunity to measure human behavior and social dynamics, as those known works showed, call logs and location traces allow researchers to undertake large-scale objective studies of social phenomena.
Here, we investigate how to infer urban popular locations with large-scale quasi-social network. A large-scale quasi-social network model is developed via measuring the number of shared-user between zones, which is different from previous models for social network. Therefore, we first verify whether or not this model also can show the social structure of given data, the ranking of places in the model have been calculated based on eigvalue metric. To understand the connections between popular locations of human activity and spatial structure, we present a method to infer the core zones in given region, and then we use a simple metric to evaluate the most popular locations of human activity. As a case study, the results can be used to design the teaching centers of night school in any city.
The first limit to our study is that the correlation between call logs of a connected cell tower and the human activities that occurred in the cell tower’s coverage region. Generally, it would be affected by geographical location of cell tower since the market share of the given telecom would have certain differences in each cell tower’s coverage region. However, we hope that the considerable scale of our aggregation of call activities and the geographical nature of the collected data would compensate for this.
Another inevitable limitation is our foundation of this research. Our original intention is to develop an acceptable method that can utilize limited types of individual activity data and thus avoiding privacy problem to access popular locations in urban. Such extreme cases are relatively rare, and we have many equally valid approach to obtain same results based on various types of activity data, however, we have faith the significance of the presented study will rise in future since this study provides a new perspective to infer popular locations in urban by using a large-scale quasi-social network. More specially, we believe the modeling of the large-scale quasi-social network require a less stringent social interaction among nodes, that can improve the practical value of our approach.
