Abstract
At present, the issue of air quality in populated urban areas is recognized as an environmental crisis. Air pollution affects the sustainability of the city. In controlling air pollution and protecting its hazards from humans, air quality data are very important. However, the costs of constructing and maintaining air quality registration infrastructure are very expensive and high, and air quality data recording at one point will not be generalizable to even a few kilometers. Some of the gains come from the integration of multiple data sources, which can never be achieved through independent single-source processing. Urban organizations in each city independently produce and record data relevant to the organization’s goals and objectives. These issues create separate data silos associated with an urban system. These data are varied in model and structure, and the integration of such data provides an appropriate opportunity to discover knowledge that can be useful in urban planning and decision making. This paper aims to show the generality of our previous research, which proposed a novel model to predict Particulate Matter (PM) as the main factor of air quality in the regions of the cities where air quality sensors are not available through urban big data resources integration, by extending the model and experiments with various configuration for different settings in smart cities. This work extends the evaluation scenarios of the model with the extended dataset of city of Aarhus, in Denmark, and compare the model performance against various specified baselines. Details of removing the heterogeneity of multiple data sources in the Multiple Data Set Aggregator & Heterogeneity Remover (MDA&HR) and improving the operation of Train Data Splitter (TDS) part of the model by focusing on the finding more similar pattern of air quality also are presented in this paper. The acceptable accuracy of the results shows the generality of the model.
Introduction
Today, 64% of the world’s population lives in urban areas, and by 2050, it is expected to grow to two-thirds of the world’s population. In other words, more than six billion people might live in cities by 2050 [1]. Thus, cities will be the forerunners of the shift towards competitive and efficient economies with reduced air pollution [2]. Certainly, cities are considered among the major players in addressing the challenges of society and the economy, including developments with low air pollution, energy efficiency, renewable energy innovation, economic growth, and so on.
The rapid growth of urbanization has led to the creation of large cities and thus the modernization of people’s lives. This has also created new challenges such as air quality and pollution, high energy consumption, and high traffic. Today, due to the enormous computing technologies and sensors and infrastructure, huge data is being generated in urban areas, such as data of transportation, air quality, traffic patterns, and geographic data. Big data holds knowledge about cities and can overcome the challenges mentioned if used correctly. For example, it is possible to identify city road problems by analyzing the mobility data of people at the city level. This discovery could be very useful in future urban planning [3].
Air quality affects human life and breathing. As the weather changes day by day and even hour by hour, the quality of the air can also vary. Managing air quality monitoring and monitoring in large cities converts air quality data into
air quality indexes and provides the public with the information they need. Therefore, the air quality index is a key tool for understanding air quality, how air pollution affects health, and air pollution protection practices.
At present, the issue of air quality in populated urban areas is recognized as an environmental crisis. Air pollution affects the sustainability of the city. Therefore, measuring air pollution and exploiting the results is very important for predicting and discovering relationships between various urban issues. Despite the development of technology, data generation, data recording, maintenance, and processing are costly and complex in some city resources. Therefore, data recording, productivity, and data analysis are crucial in these resources. For example, in controlling air pollution and protecting its hazards from humans, air quality data are very important. However, the costs of constructing and maintaining air quality registration infrastructure are very expensive and high, and air quality data recording at one point will not be generalizable to even a few kilometers.
Some of the gains come from the integration of multiple data sources, which could never be achieved through independent single-source processing [4]. Urban organizations in each city independently produce and record data relevant to the organization’s goals and objectives. These issues create separate data silos associated with an urban system. These data are varied in model and structure, and the integration of such data provides an appropriate opportunity to discover knowledge that can be useful in urban planning and decision making.
Today, a large amount of data is available on the characteristics of cities. Due to the development and advancement of ICT in smart cities, the sources of such data are expanding. However, production and maintenance of some of these resources, such as air quality sensors, are costly. Therefore, this study seeks to replace air quality recording sensors with other available data sources, with acceptable accuracy to predict one of the most important elements of air quality, particulate matter. Based on this objective, the development of an air pollution particle parameter prediction system using the integration of a variety of large and diverse data sources in the smart city and the use of synergies designed to predict the use of air pollution sensors without the use of data source Installation and maintenance of such sensors is the focus of this research. It requires the integration of diverse data sources in smart cities to access hidden knowledge that cannot be extracted from standalone data sources.
By leveraging huge data opportunities and capabilities and integrating diverse heterogeneous data sources into smart cities, one can find the right solutions for the sustainable development of cities. In this paper, by combining different big data sources in the smart city, particulate matter is predicted as one of the main sources of air pollution without the use of costly air pollution recording sensors. In other words, the feasibility of predicting air quality levels, especially particulate matter, is examined using the economic resources of data collection in smart cities. This paper is an extended version of our previous research [5] with extended experiments to show the generality of the model.
Section 2 investigates previous works that tried to use alternative data sources in urban computing applications and research that relates to air quality predictions. Section 3 introduces the dataset we used in experiments. Section 4 details the proposed framework and model, Section 5 illustrates the experiments that were conducted as well as an evaluation of their results, to verify the generality of the model, and finally, Section 6 provides conclusions.
Related work
This section first explores previous works that tried to use alternative data sources, instead of the main dataset in the domain of urban computing applications. Then, it investigates recent works that tackle issues which relate to air pollution.
Alternative dataset in urban computing
A number of investigations have shown that health-promoting centers (e.g., fitness and dance centers) can be more easily accessed in rich urban neighborhoods, whereas major health-damaging centers (e.g., fast-food outlets) are mainly located in poorer districts. The official land-use information was collected for these investigations. Nevertheless, some researchers used social media as an alternative data source to explore the connection between resources and neighborhood deprivation [6, 4, 7].
The likelihood of extracting free, recent, and reliable land-use information from Foursquare’s users, to an acceptable extent, can assist the discussion about the usability of “organic data” from geospatially-referenced social network data, in contrast to curated spatial data (i.e., the official land-use information) [7]. Some authors also declared that social media (such as Foursquare) can be used to observe the physical changes in a neighborhood at reasonable temporal resolutions, over what is possible with the official land-use data [8].
Ruiyun et al. [9] suggested a method to predict the Air Quality Index (AQI) of the whole regions in Shenyang (China), in accordance with the AQI which was reported by the air quality monitoring stations, the meteorological data of the weather stations, the road information and the real-time traffic data collected from Baidu Map and Google Maps, and the POI distributions provided by Baidu Map and Google Maps. Moreover, the random forest algorithm was applied to predict all of the uncovered regions in the downtown area.
Some differences exist between the abovementioned research and this work. This study employed a model based on the concept of transfer learning to predict the PM
The city-wide air quality, by using a small number of the AQI stations and in accordance with Granger causality analysis, was predicted in [10]. An extended spatial-temporal (S-T) Granger causality model was presented; it investigates all the causalities among urban dynamics and air pollution in a consistent approach. The inappropriate groups of urban dynamics were avoided by executing the non-causality test. In order to cope with time efficiency, a method was suggested to recognize the Region of Influence (ROI) and next, separate “big data” into “small data”. The city-wide air quality map can be obtained and visualized by integrating the training of the urban dynamics with the identification of the ROI. The results demonstrated that the causality-based method outpaced other interpolation or training techniques because it considered all the S-T correlations and processed only highly effectual urban data.
In [11] the global PM
Prediction of air pollution
Li et al. [12] developed and constructed PiMi air box, which is an economical and portable sensor and is capable of predicting PM
In [13], with the aid of air quality data for PM
Honarvar et al. [14, 15] apply spark platform for extracting sequences patterns from appliances’ power usage in smart homes. Findings show the importance of extracting sequence pattern from power usage data to various applications such as decreasing CO2 and greenhouse gas emission.
Dong et al. [16] proposed a model to extract real-time PM
Xia et al. [17] devised a technique to predict PM
Wang et al. [18] suggested a new method for assessing the environmental quality of an area using Chinese social media. They performed sentiment analysis and classification using support vector machine algorithm, which had an 85.64% accuracy rate. They defined a term called the Environmental Quality Index (EQI) and developed an environmental assessment model to assess and predict the environmental quality of the local area. The result of comparing environmental quality between different provinces of China is also very close to the air pollution rating data released by the Ministry of Environmental Protection (MEP) of the People’s Republic of China.
In [19], a machine learning framework is presented that includes multi-source heterogeneous data processing and real-time data processing. The fine-grained air quality estimates were based on data from five real-world sources, including image data provided by smartphone users. Data based on temporal and spatial distribution are divided into three categories. The results show that rational data classification can effectively ameliorate the accuracy of the assessment. On the other hand, the processing of any subclass can be more flexible. Aggregator design can effectively decrease the impact of scattered data on evaluation.
A method of estimating fine-grained PM2.5 based on random forest with data reported by meteorological departments and collected from smartphone users without any PM2.5 measuring device was presented in [20]. The researchers designed and implemented a context for real-world data collection, including user-presented imagery. By merging online learning and offline learning, the random forest-based approach performs well in terms of complexity and temporal accuracy.
Li et al. [21] proposed a new Bayesian-based kernel method to achieve fine-grained PM2.5 concentration. This model leverages heterogeneous data, which share image information, camera lens information, GPS information, and magnetic sensor information. To investigate the relationship between PM2.5 concentration and image information, a crowdsourcing system has been created and collected pictures for 16 consecutive months. The results show that compared to tripods, the proposed algorithm can reduce the forecast error by up to 35% on average.
DATA
Today, a large amount of data is available on the characteristics of cities. Due to the spread of Information and Communication Technology (ICT) in smart cities, the sources of such data are expanding. However, production and maintenance of some of these resources, such as air quality sensors, are costly. Therefore, this study seeks to replace air quality recording sensors (such as ozone, carbon monoxide, sulfur dioxide and nitrogen dioxide) with other available data sources to predict with precision acceptable accuracy, one of the most important elements of air quality which is particulate matter.
The data sources used are meteorology, traffic, points of interest and the structure of the street network. This data is known as metropolitan data and reflects the dynamics of the city. Data sources in the city are divided into two static and dynamic categories. Static data sources are data related to land use, waterways, buildings, roads, facilities and points of interest. Dynamic data sources represent meteorological, traffic, air quality, and parking data for the time period in question.
Aarhus city in Denmark, covers a 91-kilometer area with a population of approximately 270,000. Like any other city, it faces the issue of particulate matter in the area of air pollution. Figure 1 illustrates the study area. The points marked on the map depict the location of air quality stations.
Data source features
Data source features
The location of air quality record stations.
The study period is from August 1, 2014 until September 30, 2014. Five types of data were studied and used for this study. The data used is an expanded version of the City Pulse Project Data [22], which has been added to other sources collected from online sources. Meteorological data is extracted from the Wunderground website [23], air quality information as well as street traffic from the City Pulse project [22], points of interest data from Google Map [24], and road network structure from Openstretmap site [25]. Table 1 shows the desired characteristics of these data sources. The meteorological data as well as the traffic data were recorded 12 times per hour over a five-minute interval and produced a total of 14124555 data samples over the desired time period.
Part of the prepared dataset which is published in [26] are used for experiments of this paper. The prepared data are organized in three viewpoints of spatial granularity (grid based, main road, and circular area around the sensors). Details of this dataset can be found in our previous published paper [26].
Framework
Figure 2 presents the framework of the predictive model which is introduced in previous research [5], consisting of five components: Multiple Data Set Aggregator & Heterogeneity Remover (MDA&HR), Train Data Splitter (TDS), Split Data Predictor (SDP), Test Data Cluster Finder (TDCF), and Split Data Prediction Aggregator (SDPA).
Framework of the proposed predictive model.
The MDA&HR collects, integrates, and removes the heterogeneity of multiple-source data, including metrology, traffic, road network, POI, and pollution data. The heterogeneity of this dataset originates from temporal, spatial, and application domain. Temporal heterogeneity occurs as a result of various sensors in the data sources capturing data in different periods. For example, the metrological dataset is collected hourly, but traffic data are collected over shorter periods, such as 5- to 10-minute intervals. The data which are collected from sensors in the domains cover various portions of the city with different granularity. For example, weather data show the weather conditions of a region, but each traffic sensor presents the traffic conditions of a street. This causes the spatial heterogeneity of multiple data sources in the cities. The domain heterogeneity is evident because each domain shows the condition of the city from its lens, with different formats. Spatial heterogeneity is removed by dividing the city into regions and grids of various size, depending on the purpose of the study, and by mapping the features of the datasets in these regions or grids. Temporal heterogeneity is removed by exploiting aggregative functions, such as min, max, average, and sum, on the measured data features in the intended intervals for the study. For example, the average of vehicle count in one-hour period has to be considered to map and study all the datasets in 1-hour intervals. Domain heterogeneity cannot be neglected due to the conceptual and technical issues which arise in capturing the status of the city from the domain perspective. Indeed, when weather conditions are considered, humidity, wind direction, perceptions, and further elements have to be registered, but when traffic data are required, the number of vehicles and the speed of vehicles need to be captured. Domain heterogeneity cannot be omitted, although this heterogeneity is beneficial because it allows for synergy. Details of the temporal and spatial heterogeneity removing of the model described in the following section which is described in details how our models remove these heterogeneity.
The output of the MDA&HR is an Integrated Dataset (IDS) from multiple data sources. The IDS is divided into two parts: Train Integrated Dataset (Train-IDS) and Test Integrated Dataset (Test-IDS). The train and test dataset of an IDS are created based on the purpose of the study. In this paper, the test and train data are created based on the regions that are identified considering the main roads of the city. For example, if the city is divided into 11 regions, the data of 10 regions are used for the train section and the data of one region are used to test the proposed model. This process is repeated to evaluate the data of all regions as a test set. The integrated features of all data sources are included in both Train-IDS and Test-IDS implicitly for the road network or explicitly for other urban data sources. In relation to the purpose of the paper, which is to predict particulate matter using urban data excluding air quality sensors, air quality features in Test-IDS only will be used for the evaluation purpose and will not be used as the input to the proposed model. The IDS is temporally and spatially homogeneous. The TDS aims to segment the Train-IDS into multiple clusters. This causes the most similar records of the Train-IDS to be grouped in the same clusters. The rationale behind this segmentation is the fact that urban air quality is affected by multiple complex factors, such as traffic flow, metrology, and land use [27, 28], and these factors follow similar patterns of air quality for the regions of the city in each segment. Each cluster contains records of the Train-IDS that have similar patterns for urban air quality, according to the time interval parameter of the MDA&HR. These clusters are denoted by Train-IDS-Seg
The SDP is devised to create multiple predictive models for each segmented dataset (Train-IDS-Seg
The TDCF is devised to find the proximity of the test records to all Train-IDS-Seg
In the MDA&HR section, time homogenization is based on Eq. (4.2). The input parameter date specifies the desired date, and part specifies the timespan for time homogenization. Depending on the scope and problem, part values can include minutes, days, hours, multi-hour intervals, etc. The variable
After spatial and temporal homogenization, each data source is merged independently for the use and productivity of multiple independent data sources based on spatial and temporal homogeneity parameters.
Equation (4.2) shows this integration.
After preparing the IDS, the data is divided into training and test data sets. This segmentation can be done in different ways based on the type of criteria used for modeling and testing. For example, 30% of the data can be selected as a test and the rest as train data or using cross-validation with different
Similarity computation based on temporal and spatial properties
Given the rationale that air quality data segmentation is influenced by complex factors of traffic, meteorology, land use, transportation, and the quality of air. The following Eq. (4.3) illustrates how similarity is calculated according to the proposed criteria. Similarities are computed according to Eqs (4.3)–(4.3). Values
The traffic of a region can be simulated based on speed and count of vehicles. In Eq. (4.3), the
Experiments and results
Previously, the proposed model was evaluated based on main roads zones, and the results show the particularity of the approach to predict particulate matter with acceptable accuracy [5]. In addition to previous scenarios, we evaluated the models for various viewpoints of dataset in new scenarios to show the generality of the models. In these new scenarios, one region is considered as training data, and the selected region is considered for testing. In the current evaluation, the zones are considered based on the range and radius of one kilometer of traffic sensing and air quality sensors. These zones are also considered for criteria of eliminating spatial heterogeneity.
Figure 3 shows the location of the sensors and the area around each sensor. In this case, by considering a range of one kilometer within the radius of the location of each sensor, all data sources that have the geographical coordinate attribute that fall within that range will be in a position to eliminate spatial heterogeneity. POI, Traffic, air quality sensors, and road structure sensors are data sources that can be located within these areas using the geographic coordinates feature. The reason for choosing this method for zoning is to evaluate the performance of the model at smaller spatial data scales, at finer points.
In this evaluation method, as discussed earlier, within a one-kilometer range, some data sources such as points of interest will be differentiated from the regional assessment method based on the city’s main road. In this case, the number of points of interest in each area will be calculated with the desired radius and used as a feature of the data source. This is because points of interest are very diverse in terms of group and category, and practically, within a kilometer of each sensor, there is only a limited number of points of interest. Using the type of points of interest as a feature will face a problem of sparseness, negatively impacting model performance. Hence, the number of points of interest is considered as a feature of this source.
The structure of roads and streets in a city has a direct impact on the traffic of an area as well as implicitly reflecting the performance and use of an environment [29, 30]. The number of streets in an area will be extracted as another feature of the data sources of the road network structure and will be considered as input features.
According to the available meteorological data sources, all areas of the one-kilometer margin of the air pollution sensors have same feature values. Therefore, the features of this data source will not be applicable to the spatial homogenization of the different datasets. To evaluate the performance of the model, combinations of data sources that perform more efficiently in the evaluations are used. The set of combinations of data sources investigated are presented in Table 2. POI, RDN, TRA, and WEA presents point of interest, road network structure, traffic, and weather, respectively, in this table.
Combining data sources
Combining data sources
The location of the sensors and the area around each sensor.
Air quality data is only used to evaluate the accuracy of the model; therefore, in the SDP section of the model, the air quality data source attributes will not be used as inputs in the training phase. Model performance evaluation is based on the RMSE criterion. Time intervals for experiments are repeated as in the case of assessment based on the main road. The time parameter as input is used in the MDA&HR section. This parameter specifies when to exclude temporal heterogeneity between samples of different data sources based on which time intervals are merged and aggregated.
In evaluating the model in this case, as in the previous method, the models are compared with a set of basic methods such as regression and neural network that consider a combination of data sources as input feature sets. The results are presented as the avaerage of different experiments in the presented regions. In experiments using cross-validation method with parameter
In the experiments, the
Estimation values over a 3-hour radius-based period
Estimation values over a 6 hour radius-based period
Evaluation values over a 12 hour radius-based period
Estimation values over a 24-hour radius-based period
Comparison of the evaluation over a 3 hour radius-based period.
Comparison of the evaluation over a 6-hour radius-based period.
Comparison of the evaluation over a 12 hour radius-based period.
Comparison of the evaluation over a 24-hour radius-based period.
As can be seen at lower intervals, the model performs best in TOP1NN mode in combining POI, RDN, TRA, and WEA data sources. The best model performance in combining these data sources is in the TOP1NN data for a period of six hours, with a square error rate of 22.035. However, in this timeframe, the simple regression model performs better than the proposed model with the regression kernel, and the simple neural network model performs better than the ordinary regression method. When the model performs best in TOP1NN, the TOP3REG method performs more poorly than TOP1REG. The weakest model of the TOP1NN is the six-hour timeframe when the combination of POI, RDN, and TRA data sources is used. Considering the larger time intervals, the overall performance of the proposed model is generally weaker than the six-hour time interval.
In Fig. 8, model evaluation results are shown in different combinations of data sources and main road-based zones, while the data of one area as training and data of area is the most similar one as test. In calculating the most similar area, it takes into account the similarity of points of interest, meteorology, type of roads and traffic. Based on these results, the best performance of the TOP3NN method is obtained when combining points of interest, traffic, and meteorological data sources. In Fig. 9, the results are compared with the region that has the most dissimilarity in terms of road structure, traffic, meteorology, and points of interest. It is evident that the performance and accuracy of the model have decreased; this confirms the consideration of the part of the model that tries to use the data with most similar pattern to the intended area for predictions.
RMSE values for a 24-hour time interval (spatial heterogeneity 
RMSE values for a 24-hour time interval (spatial heterogeneity 
RMSE values for a 6-hour time interval (spatial heterogeneity 
Figure 10 shows the results of the model evaluation in different combinations of data sources and zones based on a one-kilometer radius. The data of one area is considered as training data, with data from the most similar non-neighboring area as test. Based on these results, the best performance of the TOP1NN method is obtained when combining the POI, TRA, WEA, and RDN data sources. In Fig. 10, the results are compared with the non-neighboring area data that has the most dissimilarity in terms of road structure, traffic, meteorology, and points of interest. It is evident that the performance and accuracy of the model are the weakest in TOP3REG mode and that the average accuracy of estimations and predictions has decreased.
This paper investigated more evaluation scenarios on the proposed model for measuring air pollution using less-expensive means (e.g., the POI, meteorological, road network, and traffic data) and without using expensive pollution sensors and facilities. It presented details of removing the heterogeneity of multiple data sources in the Multiple Data Set Aggregator & Heterogeneity Remover (MDA&HR) and improving the operation of the Train Data Splitter (TDS) part of the model by focusing on the finding more similar patterns of air quality. We evaluated the models for various viewpoints of dataset in new scenarios to show the generality of the models. In these new scenarios, one region is considered as train data and selected region is considered as the test. In the current evaluations, the zones are considered based on the range and radius of one kilometer of traffic sensing and air quality sensors in addition to previous viewpoints. Results confirm the generality of the model and consideration of the part of the model that tries to use the data with most similar pattern to the intended area for predictions. In future works an extension of the proposed framework can be applied for healthcare system that uses Fifth generation network infrastructures applied to smart cities environment that can enhance the applicability of the proposed framework [30, 31, 32, 33].
