Abstract
Recognizing urban functions is crucial for understanding urban spatial structures and urban planning. Previous work has investigated urban functions based on human activities that were derived from mobile phone positioning data, check-in data, taxi data, etc. However, urban functions can only be comprehensively sensed from both human activities and the physical environment together. To do so, a deep learning method was proposed to predict urban functions by integrating social media data and street-level imagery. The verbs extracted from social media posts were taken as the proxy for human activities, and we identified urban physical environmental information from street-level imagery. Then urban functions were uncovered from both the verbs in terms of human activities and street-level imagery from the perspective of the physical environment. Twelve types of urban function were recognized by verbs in social media posts, which were then improved by integrating street-level imagery within the 5th Ring Road of Beijing, China. The experiment demonstrated that verbs as direct proxies for human activities can avoid noise, and the multi-source data integration eliminated biases caused by a single data source. This work provides a comprehensive understanding of urban structure and dynamics for urban management and planning.
Introduction
An urban space, where citizens gather, live, and socialize, presents functional dynamics as various human activities take place (Janowicz, 2012; Tu et al., 2017; Zhong et al., 2014). The functional dynamics are depicted by urban functions, which are the spatial aggregation of similar human activities (Crooks et al., 2015). Specific human activities result in particular urban functions of a place, such as work, entertainment, residential areas, industrial areas, etc. (Wu et al., 2018). The structure and physical environment of a city, including the streets, buildings, facilities, and even their layouts, enable and drive human activities. These activities, in turn, shape the urban form (Smith and Crooks, 2010). The in-depth understanding of urban functions will benefit traffic congestion management, public services, and smart urban planning (Ahas et al., 2015).
Before the arrival of the big data era, urban functions were extracted from field surveys and remote sensing (Jiang et al., 2012). Remote sensing is an important approach for capturing physical environmental information in which land use classification is one of the main tasks (Gong and Howarth, 1990; Hu and Wang, 2013). The emergence of social sensing brings unprecedented opportunities to investigate urban functions comprehensively, both from the perspective of human activities and the urban physical environment (Liu et al., 2015). Personal location trajectories have been employed for urban function recognition (Ahas et al., 2015; Calabrese et al., 2015; Pei et al., 2014). Similarly, urban land-use was classified from traffic source–sink areas based on taxi data (Liu et al., 2012). Functional regions were discovered by the co-occurrence patterns of POI types in POI data sets (Gao et al., 2017). Furthermore, the proportions of each kind of urban function were determined by decomposing the temporal signatures of POI check-ins (Wu et al., 2018). Spatiotemporal activities and activity potentials in urban spaces were also identified from social media and web data (Fu et al., 2018; Van Weerdenburg et al., 2019).
These studies recognized urban functions only from the perspective of human activities. However, both the built environment and human activities characterize and reveal urban functions. The former provides static properties of places, while the latter generates their dynamic behavior. Compared with the advantages of remote sensing images in large-scale earth observations, street-level imagery can capture finer grained information in the built environment (Zhang et al., 2019; Zhou et al., 2014). From street-level imagery, human perceptions were measured and human mobility was predicted (Zhang et al., 2019, 2018), but the same problem of not taking human activities into account arose.
Data fusion is a promising approach to overcome the biased representativeness of a single source of data. Semantics of human activities retrieved from mobile phone positioning data were enhanced by incorporating check-in data and social media data (Tu et al., 2017). Meanwhile, regions of different functions were discovered by fusing human mobility and points of interests located in a region (Jiang et al., 2012).
Compared with these data sources, social media has more advantages in representing human activities (Liu et al., 2015). The quantity of social media entries indicates the intensity of human activities, while the texts contain rich information about these activities and place semantics. Spatiotemporal urban activities were identified from notional words extracted from social media data (Blei et al., 2003; Fu et al., 2018). These methods are inefficient because not all notional words are related to activities. Indeed, in natural language, verb–noun phrases (e.g. “build a mall”) are the primary way to describe activities. However, a complete verb–noun phrase is not common in social media texts due to the large volume of noun vocabulary, so verbs will be a better choice as the proxy for human activities. Verbs are more concise for expressing human activities. They can also prevent noises caused by irrelevant notional words. A large number of verbs make it possible to recognize urban functions.
The sparsity of social media data may lead to the unreliability of results. Inspired by place affordance, urban function recognition can be refined after capturing features of the urban physical environment (Jordan et al., 1998; Zhang et al., 2019). Street-level imagery is a good choice to mitigate the sparsity issue as it contains finer information about the physical environment at the street scale.
In this study, we present a method to recognize urban functions from two perspectives: human activities and the physical environment. Based on deep neural networks, the information about urban functions extracted from social media text is used to label street-level imagery. Urban functions are first inferred through aggregating human activities identified by verbs from social media posts. Then, the sparsity issue caused by social media is further mitigated by constructing the relationship between urban functions and the urban physical environment depicted by street-level imagery. The results demonstrate that verbs are appropriate as a proxy of human activities, and incorporating street-level imagery can largely improve the performance.
Data
Sina Weibo (https://www.weibo.com), a Twitter-like service, is one of the most popular social media platforms in China. It allows users to tag their mobile device coordinates on Weibo posts. Each post contains the post ID, user ID, post time, post coordinates, and text content. Sina Weibo posts from January 2016 to January 2017 in Beijing, China were collected by invoking the Sina Weibo API. To avoid the effect of irrelevant information on the results, we pre-processed the data set as follows: first, \#Theme\#, [emoji], @username, and http links were removed from the contents; second, blank posts were deleted from the data set; third, Weibo posts with repeated content were deleted if the length was greater than 10. Finally, 2,816,803 geotagged Weibo posts were retained, and their distribution is shown in Figure S1A in the Supplemental Material. The average number of records per grid cell is 997 and is much higher in hotspots. We believe that this amount of data is enough to support the study.
Street-level imagery is requested and collected at intervals of 30 meters along the Beijing road network in 2016 by calling the Tencent street view image API (https://lbs.qq.com). There were 256,089 locations as sampling points. Each sampling point contains a point ID, record time, coordinates, and four street view images.
Land use, POI check-ins, and taxi trajectories were used to validate the results. The land use data were published in 2017 by the planning administration of Beijing, with a 30 meter spatial resolution, shown in Figure S2 in the Supplemental Material. Land use categories are shown in Table S1 in the Supplemental Material. POI check-in records (132,998) in Beijing in 2016 were collected, which included ID, name, coordinates, primary type, and the number of check-ins. POI categories are shown in Table S2 in the Supplemental Material. People often travel by taxi between various places in a city. The nature of places can be partially revealed by taxi flows. The trajectories of 20,000 taxis in Beijing in 2016 were collected, each of which recorded the locations and times of the pick-up and the drop-off. The spatial distributions of POI check-ins and street-level imagery are displayed in Figures S1B and S1C in the Supplemental Materials. The inner area of 5th Ring Road of Beijing was set as the study area because it has diverse urban functions. A grid with a spatial resolution of 500 meters was chosen as the unit for spatial analysis.
Methodology
A novel deep-learning-based method was presented to recognize urban functions from multi-source data. First, urban functions were initially recognized through aggregating human activities identified by verbs in social media texts from the perspective of dynamic behavior, named urban functions based on verbs (UFV). Then, the physical visual enviromental information was extracted from street-level imagery, called urban functions based on street-level imagery (UFS). Finally, urban functions were inferred comprehensively by fusing urban functions based on dynamic behaviors (verbs) and static properties (street-level imagery).
UFV in social media texts
Following the “data–human activities–urban functions” stream (Tu et al., 2017), there are two issues to consider. The first issue is how to represent human activities accurately. As a more direct proxy for human activities than other language components like nouns and adjectives, verbs avoid the influence of possible noise in social media texts. It can be argued that verbs representing similar human activities have similar structural features in the corpus, and these features can be captured by word embeddings (Mikolov et al., 2013; Turney and Pantel, 2010). The second issue is how to recognize urban functions from human activities. Urban functions are defined as the aggregation of similar human activities (Crooks et al., 2015). A verb, being different from a verb–noun phrase, may belong to multiple urban function categories with different possibilities called urban function membership. Therefore, a soft cluster method is performed on verbs to define urban function categories. The quantities of different types of verbs indicate the proportions of different urban function categories. Based on urban function membership and quantities of verbs, urban functions are recognized initially. The workflow is shown in Figure 1.

Recognizing UFV in social media texts.
First, verbs are extracted from Weibo posts after Chinese word segmentation and part-of-speech tagging. Verbs that are less related to human activities (e.g. “is,” “want,” “have,” and so on) are removed by artificial selection based on their semantics. It is found that these verbs cannot distinguish the spatial heterogeneity of urban functions because of their similar spatial distribution to Weibo posts. Next, the remaining verbs are mapped to feature vectors by word embedding technology. Then, singular value decomposition (SVD) is performed to reduce the dimension of word embeddings for subsequent clustering.
Fuzzy c-means clustering is used to allocate a verb to multi-urban function categories with different probabilities (Bezdek, 2013; Dunn, 1973). This is an unsupervised clustering approach in which the cluster number
The verb frequency vector
Enriching urban function by incorporating street-level imagery
The UFV-based urban functions may not be statistically significant due to the sparsity of social media data. Inspired by place affordance, UFV-based urban functions can be improved by incorporating street-level imagery. In places with enough verbs, the relationship between urban functions and the urban physical visual environment is built using a neural network model. Then, the neural network is applied to the places with insufficient verbs to predict their urban functions. The workflow is shown in Figure 2.

Recognizing urban functions based on the physical environmental feature of places using a neural network. DNN: deep neural networks; UFV: urban functions based on verbs.
ResNet18, a convolutional neural network with 18 weighted layers, is pre-trained on more than a million images from the ImageNet database to classify images into 1000 object categories such as plant, sport, person, and covering. It is employed to extract the visual feature of street view images as the network has learned rich feature representations for various scenes (He et al., 2016). This feature extractor has proven to be efficient in various computer vision tasks (Zhang et al., 2018; Zhou et al., 2017, 2014). Each street view image is processed and summarized by ResNet18 as a 512-dimensional numeric vector that contains the semantic and contextual information about the physical environment of a scene.
Similarly, the feature vectors of street view images are decomposed into a 108-dimensional numeric vector
The relationship between urban functions and the urban physical visual environment is built by training a fully connected neutral network with two hidden layers. The training accuracy is measured by Spearman’s rank-order correlation coefficient and the Pearson correlation coefficient. After the neural network was trained, urban functions at all red shooting points are predicted by the neural network. The mean of the predicted proportions of an urban function at all shooting points in a grid cell is determined as the function proportion from the perspective of the urban physical visual environment.
Comprehensive urban functions based on human activities and the physical environment
In the places with sufficient verbs, UFV-based urban functions are closer to reality because verbs directly represent human activities. However, the reliability of UFV-based urban functions deteriorates as the number of verbs decreases. When the number is less than the threshold (the median of the verb number), the performance from street-level imagery is better to some extent. A weighted fusion method of urban functions is presented as equation (3) to improve the UFV-based results by integrating street-level imagery
Results
UFV in Weibo posts
Verbs were extracted from Weibo posts to represent human activities. After removing the verbs that were less related to human activities, 1909 verbs were ultimately retained. The top 20 verbs are shown in Table S3 in the Supplemental Material. Verbs were clustered by the fuzzy c-means approach. Figure 3 shows the results when the cluster number is 12 determined by the fuzzy partition coefficient. Each cluster indicates an urban function. Verbs with a membership value greater than 0.8 for each function were used to generate word clouds. According to the corresponding word clouds, the urban functions were named as work, recreation, leisure, relaxation, shopping, daily life, rest, study, exercise, traffic, housework, and other, respectively. Furthermore, the proportions of the urban functions in the study area were calculated. In the end, the urban function category other was left alone because it cannot be explained.

The named urban functions by verb clouds. (a) Shopping (17.8%), (b) work (16.6%), (c) housework (16.6%), (d) exercise (14.7%), (e) recreation (11.7%), (f) rest (5.5%), (g) daily life (3.0%), (h) study (2.4%), (i) traffic (1.9%), (j) relaxation (0.8%), (k) leisure (0.7%), and (l) other (8.4%).
Urban functions with the top 3 highest proportions were shopping, work, and housework. Urban functions that cannot be explained made up 8.41%. Urban functions about entertainment, which included shopping, recreation, leisure, and relaxation, made up 31.09%. Next, the proportion of daily life reached 25.01% including daily life, rest, and housework. Then, the proportion of urban functions about work was 16.60% only, including work. It is found that the main urban functions based on social media texts were entertainment, daily life, and work. The high proportions of these three urban functions are because people prefer to share topics about recreation and work.
UFS
The number of verbs within 250 meters of each shooting point was counted. The median number of verbs of all shooting points, which is 266, was selected as the threshold to determine whether a shooting point had enough verbs. In total, there were 124,806 shooting points with more than 266 verbs. After removing data dependencies (explained in the “Enriching urban function by incorporating street-level imagery” section), 48,177 shooting points were finally retained as training data. The feature vectors and labels of the training data were
UFV-based urban functions are expected to be improved by integrating street-level imagery. The spatial distributions of urban functions should be similar to the results based on verbs in verb-intensive places, where the neural network can fix the misclassification caused by insufficient verbs. Taking entertainment as an example, the spatial distribution based on verbs and street-level imagery is shown in Figure S3 in the Supplemental Materials. Although some typical places were not detected, the overall distribution of the predicted urban functions was consistent with the results based on verbs in verb-intensive places. Typical entertainment places that were misclassified due to insufficient verbs were fixed by the street-level imagery. In general, street-level imagery achieved the expected effectiveness of inferring urban functions from the perspective of the physical visual environment.
Urban functions based on multi-source data
Finally, the values of the urban functions were calculated using equation (3). After fusing the urban functions, the spatial distribution patterns of all urban function categories became clearer, as shown in Table 1. These distributions can be further synthesized into three patterns including merging, diverging, and scattered.
The spatial distribution of urban functions.
The first pattern of spatial distribution merging included work, shopping, recreation, and other. It reflected that related activities gather in the center of cities, and the intensity of central activities was much higher than the surroundings. Work (0–21.10%) was mainly distributed in the northwest of the study area. Typical places with a high proportion of shopping (0–30.86%) were concentrated in the west. The area with a high proportion of recreation (0–39.67%) was distributed in the center of the study area. This is in line with the real distribution of human activities for work and entertainment. Furthermore, there were mainly three parts including work, entertainment, and the unexplainable part. This reflects the phenomenon that work places and places of entertainment are mainly concentrated in the main urban areas of most metropolises. The distribution of other (0–20.05%) was similar to the overall distribution of Weibo posts. The proportion of other was the degree of inexplicability from this method.
The second pattern was diverging, which is contrary to the merging pattern. These were mainly some daily life and leisure activities. The proportion of urban functions about daily life including rest (0–30.32%) and housework (0–34.62%) was lower in the center and higher around the study area. The spatial distribution of rest was related to the distribution of residential areas. The separation of workplace and residence was obvious compared with the merging pattern of work and entertainment. The distribution of leisure activities, including leisure (0–6.17%) and relaxation (0–14.64), was similar to the distribution of daily life activities, but different from entertainment activities.
Different from the above two patterns, the scattered pattern indicated the spatial discretization of these activities. Traffic activities (0–21.80%) were scattered in Beijing, even though they showed certain diverging characteristics. Activities about study (0–1.90%), exercise (0–20.00%), and daily life (0–10.20%) were scattered more significantly.
These three patterns can be divided into two categories. The first is aggregation including the merging and diverging patterns. The differentiation between the merging and diverging patterns may be determined by the relationship between the nature of different urban function categories and economic factors. The second is dispersion, i.e. the scattered pattern. These activities happen everywhere without an aggregation trend or in some small places.
The typical places of urban function categories
The spatial aggregations of high-proportion grid cells of an urban function were identified as its typical places. If typical places of urban functions were in line with reality, the recognition results can be considered reasonable and therefore the method is reliable. Work and entertainment related functions were chosen as examples because their typical places were easily identified in reality. Typical places were circled by red polylines, and some street view images in these typical places are shown in Figure 4.

Typical places of urban function: (A) work and (B) entertainment.
There were eight typical places about work that are marked in Figure 4(A) including (a) Peking University, (b) Tsinghua University, (c) University of Science and Technology Beijing, (d) Asian Sport Village, (e) Sanyuan Bridge, (f) Jiuxian Bridge, (g) Zhongguancun Technology Park, and (h) Financial Street. These typical places are all well-known working areas in Beijing, belonging to various work types, such as scientific research, Internet information technology, finance, fashion industries, and more. It was found in the street view images of such places that tall buildings without external billboards were the representative elements.
Urban functions about entertainment included recreation, leisure, relaxation, and shopping. Because of their similar semantics, these functions types were investigated together. Nine typical places were discovered as marked in Figure 4(B) including (a) Dong Zhimen, (b) Wangjing, (c) Niujie, (d) Workers’ Indoor Arena, (e) Xidan, (f) Wudaokou, (g) JOY CITY, (h) Wangfujing, and (i) Asian Sport Village. All these typical places are entertainment regions where citizens go for relaxing, shopping, entertaining, and enjoying various foods. In the street view images of these typical places, the visual elements were quite different. However, in general, these places usually had distinctive styles of buildings that people can easily recognize as entertainment venues.
Street-level imagery had similar elements in typical places of the same urban function but was quite different in typical places of different urban function. It indicated the possibility of street-level imagery to recognize urban functions from the perspective of the physical visual environment.
The most typical work and entertainment places could be effectively identified, but in some places, there were inaccurate recognition. For example, Yizhuang, which is a typical workplace shown in Figure 4(j), could not be recognized because it is at the boundary of the study area. At the same time, Jianguomenwai Dajie, shown in Figure 4(A)(i), was incorrectly identified as a typical entertainment place due to the combination of mixed function and biases of social media data. Overall, mixed function and social media biases jointly affect our results using this method.
Verification of the results
The methodology was verified according to two aspects: (1) As the urban form and functions are highly interrelated, the final results were validated by checking their correlation with land use; and (2) places with a specific urban function will show distinct temporal characteristics due to their diverse activities. The effectiveness of the method was verified by checking the consistency between the temporal signatures of typical places based on taxi trajectories and the corresponding urban functions.
The correlation between urban functions and land use
Land use is the physical support of urban functions, and urban functions are the actual performance of land use. Therefore, urban functions can be validated by verifying the correlation with land use despite no ground truth for the urban functions. The correlations with urban function based on verbs, street-level imagery, and fusion were measured by Spearman’s rank-order correlation coefficient. The results are shown in Table 2.
The correlation between land use and urban functions.
*p < 0.1, **p < 0.01.
POI check-in data were also introduced to further verify the results as a good indicator of human activities. They were quite different from the types of urban functions. POI types that correspond to urban functions were selected for result validation, including Entertainment, Scenic Spots, and Leisure. The number of check-ins at each category of POIs was counted in all grid cells and was then normalized by total check-in numbers. This represents the proportions of each category of land use (Wu et al., 2018). The results are shown in Table 3.
The correlation between POI check-in proportion and urban functions.
POI: point of interests.
**p < 0.01.
The correlation with the UFV-based urban functions almost stayed at 0.15. The low correlation might result from the sparsity issue of Weibo posts. The correlation with the UFS was slightly higher than the UFV and stayed at 0.18. After fusing the results of both street-level imagery and verbs, the correlation increased significantly and remained at 0.45. The increases show the improvement due to the fusion approach. The result indicated that urban functions can be recognized by human activities identified by verbs, and the sparsity of social media can be alleviated by incorporating street-level imagery.
The temporal signatures of taxi destinations
Human mobility retrieved from taxi trajectories can infer urban land use and function (Liu et al., 2012). To further verify whether these typical places correspond to their urban functions, the temporal signatures of taxi destinations were drawn after counting the number of taxi arrivals 24 hours a day within a year. Three typical places were chosen from work and entertainment (typical work places: (a) Peking University, (b) Sanyuan Bridge, and (c) Jiuxian Bridge; typical entertainment places: (d) Workers’ Indoor Arena, (e) Wudaokou, and (f) Wangjing). The time spectrum curves of taxi drop-offs in these typical places are shown in Figure 5.

Temporal signatures of taxi flows (destination). (a) Peking University, (b) Sanyuan Bridge, (c) Jiuxian Bridge, (d) Workers’ Indoor Arena, (e) Wudaokou, and (f) Wangjing.
There was always one peak between 8 and 12 am in typical work-related places. Typical entertainment places had two peaks, one at noon and another at afternoon or night. These functions have similar temporal sequences to the land uses that were recognized in Liu et al. (2012). In Figure 5(c) and (f), there were not only morning peaks representing Work but also afternoon or night peaks representing entertainment in Jiuxian Bridge and Wangjing. These two places are adjacent to each other, therefore resulting in complex urban functions. This analysis showed that the proposed method could accurately distinguish between mixed urban functions.
Conclusions
Various human activities in the urban space result in varying functional dynamics. Urban functions reflect the capacity of urban space to carry human activities. An in-depth understanding of urban functions helps to understand urban structures and facilitate urban planning and management. Compared with traditional methods, it can reveal urban functions and detect their changes in real time more quickly. Therefore, there are two potential applications. The first is to promptly verify the effect of urban planning, and the second is precise location recommendations for the coming era of the Internet of Things and Internet of Vehicles.
In this paper, a methodology of urban function recognition based on multi-source data was presented. In detail, verbs were extracted from social media texts as the proxy for human activities. Twelve urban functions were recognized by clustering verb embeddings, including shopping, work, cook, exercise, recreation, rest, daily life, study, traffic, relax, leisure, and other. Then, the sparsity issue of social media data was mitigated by incorporating street-level imagery. The validation of the results showed that urban functions were effectively recognized using verbs as a proxy for human activities and were largely improved by incorporating street-level imagery.
The contribution of this work is twofold: (1) Verbs were first used as the proxy for human activities on urban function recognition. Compared with notional words, verbs as the more direct proxy for human activities avoid the influence of noise. (2) Social media and street-level imagery were integrated into urban function recognition from both a semantic and visual perspective. This study helps to understand urban structures and man–land relationships in a better way.
Nevertheless, two issues need to be solved. One is that some urban functions are missing due to the bias of social media. For instance, the top 20 verbs extracted from Weibo posts were mostly related to human activities about entertainment, work, and daily life, but none about industries. This problem can be potentially addressed by incorporating more data sources, for example the number of industrial employees (industrial activities), traffic data (transportation activities), etc. (Yuan et al., 2019). The other is that some indoor activities are hidden when urban functions were detected mainly by the street-level imagery. Street-level imagery can identify indoor activities when the functions of buildings are obvious like restaurants or bars. However, some indoor activities are hidden when the building style and surrounding environment are similar between different urban functions. The percentage of POI check-in numbers was used to try and mitigate this issue, but relevant improvement was not obvious. Maybe more representative data, which could reflect indoor activities, can alleviate or solve this problem, for instance the type of companies inside buildings, the number of employees, and their output value. These data could be used to uncover the types and intensities of activities inside buildings.
Supplemental Material
sj-pdf-1-epb-10.1177_2399808320935467 - Supplemental material for Urban function recognition by integrating social media and street-level imagery
Supplemental material, sj-pdf-1-epb-10.1177_2399808320935467 for Urban function recognition by integrating social media and street-level imagery by Chao Ye, Fan Zhang, Lan Mu, Yong Gao and Yu Liu in Environment and Planning B: Urban Analytics and City Science
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the National Natural Science Foundation of China (41971331, 41625003).
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
