Abstract
This paper describes an open-source method for generating megacity-scale building height maps without proprietary software or Differential Global Navigation Satellite System surveyed ground control points (GCPs). We use the open-source Satellite Stereo Pipeline (S2P) software along with four scenes of 2.5m resolution Cartosat-1 data for Bengaluru to demonstrate this. Digital Surface Models (DSMs) of 5 m resolution are generated using S2P, and terrain removal is achieved using 30m SRTM data resampled to 5m. The resulting normalized DSM is calibrated and validated using 1270 GCPs. These were generated by counting the number of occupiable floors of buildings and marking zero elevation ground points. The final building height map of Bengaluru covers a total area of about 1420 km2, and across the four scenes, has RMSE values ranging from 2.8m to 3.9m—an error of approximately one floor. Furthermore, we implemented this workflow using stereo imagery for Mumbai, and the RMSE values obtained were comparable to those for Bengaluru. Hence the method we describe is a very cost-effective way of generating megacity-scale building height maps. The height maps generated using this method can be used to better understand numerous urban characteristics including land use intensity and population distribution and can play a crucial role in urban planning and policy making.
Keywords
Introduction
Advances in the field of remote sensing have helped improve our understanding of the structure and nature of cities and trends in urban expansion (Zhu et al., 2019). Despite this, the majority of the literature, which focuses on using satellite imagery to map urban structure and expansion, have restricted themselves to two-dimensional mapping and analysis of cities (Angel et al., 2016; Qiao et al., 2019; Taubenböck et al., 2012). Since cities are three-dimensional entities, an understanding of the third dimension is critical for most urban spatial analysis efforts. Building height data is important for understanding population distribution (Balakrishnan, 2020; Xie, 2006) and energy use within a city (Resch et al., 2016). Information about the third dimension can also help formulate effective urban planning regulations since a major concern of most urban planning regulations is managing the intensity of the land use (Alexander et al., 1988; Churchman, 1999). Understanding this land use intensity at the scale of cities requires detailed building height data, for example, to quantify the total residential built-up space available per hectare or sq. km. of land area.
Studies that focus on building height mapping from satellite data can be grouped into three categories based on scale and resolution. The first consists of studies that demonstrate methods for building height mapping using data for part of a city, at a resolution sufficient to understand intra-urban variations. While there are multiple studies that demonstrate the use of LIDAR for such building height mapping (Bonczak and Kontokosta, 2019; Madhavan et al., 2006; Malpica et al., 2013; Priestnall et al., 2000; Shan and Sampath, 2005), we do not discuss these since at present such methods are expensive and not easily scalable, especially in less wealthy countries where much of the future urbanization is expected to occur. Non-LIDAR based studies typically rely on stereo imagery (e.g., d'Angelo et al., 2010; Poli and Caravaggi, 2013), single-view satellite imagery (e.g., Liu et al., 2020) or high-resolution Synthetic Aperture Radar (SAR) data (e.g., Geiß et al., 2019; Gamba et al., 2000; Marconcini et al., 2014; Sun et al., 2017; Thiele et al., 2006). Of these, although high-resolution SAR data-based methods are scalable, the raw data are currently very expensive in comparison to optical satellite imagery of similar resolution. However, optical imagery-based methods all require either proprietary software, DGNSS surveyed Ground Control Points (GCPs), expensive LIDAR-derived training datasets, or some combination of the above (Bagheri et al., 2018; Xu et al., 2015). Perhaps due to these constraints, these methods are typically demonstrated for a small part of a city.
The second category of studies focuses on urban height analysis at a global scale, but with very coarse data, which has a resolution in the range of 1 km–5 km (Li et al., 2020; Frokling et al., 2013; Mahtta et al., 2019). Such resolutions are too coarse for understanding intra-urban patterns in building height distribution and the intensity of land use.
In comparison, very few publications fall in the third category, which focuses on methods for three-dimensional analysis and can be implemented at the city or regional scale, and with a resolution that is sufficient to understand intra-urban patterns in building height distribution. Existing studies of this nature typically rely on proprietary software (Warth et al., 2019; Wurm et al., 2014). The only study in this category that does not rely on proprietary software is by Frantz et al. (2021). They use LIDAR-derived building height data for more than 15 million individual buildings to train a machine learning model that can generate 10 m resolution building height maps from Sentinel-1 and Sentinel-2 data.
This gap in the current state of the literature is especially significant since satellite stereo imagery of sufficient resolution (SPOT5, Cartosat-1, IKONOS), which can be used for extracting building height information at a city scale, has been available since at least the early 2000s (ESA, n.d.). However, the software tools for processing stereo imagery to generate building height maps are typically proprietary and expensive. In addition, to generate maps of sufficient accuracy, these tools require Differential Global Navigation Satellite System (DGNSS) surveyed ground control points (GCPs), which is also an expensive proposition. Hence, the lack of three-dimensional analysis of entire cities is most likely due to the difficulty and expense involved in processing stereo imagery to generate building height maps. This paper addresses this problem by demonstrating an approach for generating 5 m resolution megacity-scale building height maps using an open-source tool called Satellite Stereo Pipeline (S2P). The method we describe does not require DGNSS surveyed ground control points or any high-resolution training datasets.
Advances in building height mapping
Initial approaches for building height extraction from remotely sensed data involved manual application of photogrammetric methods to single-view oblique aerial photographs. Gay (1956; 1957) demonstrated how this could be done using nomographic charts. Later, some studies attempted to extract building height data based on the shadow length from aerial photographs (Huertas and Nevatia, 1988) and satellite imagery (Cheng and Thiel, 1995; Hartl and Cheng, 1995). But the applicability of such an approach was limited to sparsely located buildings, since in other instances, the shadow may fall partly or entirely on other buildings leading to a “blocked shadow” problem (Cheng and Thiel, 1995). In addition, the process was manual and therefore nearly impossible to perform at a large scale. Since then, satellite stereo imagery has been used to perform building height mapping in a relatively more automated manner.
The term “satellite stereo imagery” refers to image pairs of the Earth’s surface generated by satellites that are equipped to capture multiple images of the same Earth surface region from different points in space (Poli and Caravaggi, 2013). If we have information about satellite location, camera sensor, and its orientation, it is possible to calculate the relationship between points on the Earth’s surface and pixels on the image. Once this is done, the shift or disparity that occurs for any point when viewed from two different locations in space can be computed using the stereo image pair. These disparity maps can be processed to calculate the elevation of each point, thereby yielding a model of the Earth’s surface that includes the terrain and all other objects like buildings and vegetation which may exist on it (d’Angelo et al., 2010). Typically, these models are referred to as Digital Surface Models (DSMs), and when all objects above the Earth’s surface have been removed from it, the resulting terrain model is called a Digital Elevation Model (DEM) (Heidemann, 2018).
Stereo datasets from numerous satellite series, including SPOT, Worldview, IKONOS, KOMPSAT, Quickbird, Cartosat, and Pléiades, have been used to generate DEMs and DSMs using a variety of proprietary software tools such as DLR’s Catena (Krauß et al., 2008; 2013), the Leica Photogrammetry Suite (Shaker et al., 2011), PCI-Orthoengine (Aguilar et al., 2019), opticalSCAPE (Poli et al., 2015), Geomatica and SocetSet (Barbarella et al., 2017), among others. In addition to these tools being expensive, these approaches typically need manual intervention at various stages and use DGNSS surveyed GCPs for calibration and accuracy assessment purposes. At present, there are only two fully automated open-source tools for DSM generation: Digital Automatic Terrain Extractor (DATE) and Satellite Stereo Pipeline (S2P) (de Franchis et al., 2014a; Di Rita et al., 2017). Of these, only S2P currently supports the use of Cartosat-1 datasets. Hence, we use S2P in this paper and describe an approach that can generate building height maps of sufficient accuracy for city-wide analysis without the use of DGNSS surveyed GCPs.
Apart from optical imagery, LIDAR (Light Detection and Ranging) and SAR (Synthetic Aperture Radar) datasets can also be used to understand the third dimension of cities. Both LIDAR and SAR are active remote sensing systems in which laser and radio/microwaves, respectively, are transmitted and reflections are collected by a sensor (NOAA, n.d.; NASA, n.d.). In recent years, several studies have used LIDAR datasets to derive high accuracy building height maps (Bonczak and Kontokosta, 2019; Madhavan et al., 2006; Malpica et al., 2013; Priestnall et al., 2000; Shan and Sampath, 2005). However, in comparison to satellite stereo imagery, LIDAR surveys are extremely expensive to carry out, especially at the scale of megacities in the developing world where there are severe resource constraints. Recent studies have also used high-resolution SAR data to generate building height maps (Geiß et al., 2019; Marconcini et al., 2014). But the requirement of expensive high-resolution SAR data makes such approaches more difficult to implement in resource constrained settings. Frantz et al. (2021) demonstrate a method for generating 10 m resolution building height maps from a combination of freely available SAR and optical data from the Sentinel-1 & 2 satellites. Their method uses LIDAR-derived building height data for more than 15 million buildings to train a machine learning model to generate the building height maps. However, as the authors point out, their model may require further adaptation and training before it can be used in countries that are different from Germany in terms of climatic and urban characteristics. Hence, at present, the method we describe in this paper is the most viable means of generating building height maps at a megacity scale, especially in the context of less affluent countries.
Contribution of this paper
This study describes a complete open-source workflow for generating such megacity-scale building height maps. We use a fully automated open-source tool called S2P for DSM generation (de Franchis, 2015; de Franchis et al., 2014a, 2014b, 2014c) and SRTM data for terrain removal. Our approach can generate building height maps with an RMSE of approximately one floor without the use of expensive DGNSS surveyed GCPs. We demonstrate this using Cartosat-1 data for Bengaluru, a city estimated to have a population of more than 12 million as of 2020 (World Population Review, 2020).
At present, there is no comparable method that can generate height maps at a similar scale, resolution and accuracy, without incurring major expenses in proprietary software, DGNSS surveyed ground control points or more advanced data collection systems like LIDAR or high-resolution SAR. This we believe is a major contribution, especially in the context of less affluent countries where much of the future urban expansion is expected to take place.
Data and methods
Description of Cartosat-1 stereo data
The Cartosat-1 satellite, operated by the Indian Space Research Organization (ISRO), has two 2.5 m resolution panchromatic cameras: one facing forward (Fore) by 26o from the vertical and another facing backwards (Aft) by 5o from the vertical as shown in Figure 1 (National Remote Sensing Centre [NRSC], 2015). As the satellite travels in its orbit at a height of approximately 618 km, first the Fore camera starts capturing a scene on the ground and after about 50 s of orbital travel, the Aft camera starts capturing the same scene from the new position of the satellite (NRSC, 2015). Together the Fore and Aft images comprise the stereo pair for a single ground scene. Schematic diagram showing stereo image acquisition by Cartosat-1 Fore and Aft cameras (based on Evans et al., 2008).
Details of the Cartosat - 1 2.5 m Pan Stereo scenes (Fore & Aft) used in the study.
The process of building height derivation consists of four main steps: DSM generation from Cartosat-1 Stereo data using S2P; nDSM generation by terrain subtraction from the DSM; calibration and validation; and height extraction for built-up features. The details of the workflow are provided in Figure 2. Detailed workflow for building height mapping.
S2P Algorithm
S2P is a fully automated open-source pipeline for stereo image processing. It receives a pair of optical images with their RPC camera models as input and produces a DSM of the area seen in both input images as an output. The algorithm relies on the principle of stereo vision: 3D information is derived from the relative positions of objects seen in two images taken from different viewpoints. The core idea of S2P is to chunk the large input images into small tiles (default size is 800x800 pixels), on which the rather complex RPC camera model can be accurately approximated with a much simpler affine camera model, which in turn allows application of standard computer vision tools for multiview stereo, such as stereo rectification, stereo matching, and triangulation (de Franchis, 2015).
As a first step, for each pair of corresponding image tiles, the RPC camera model of the second image is first corrected with respect to the first image, in order to correct the relative pointing error coming from small inaccuracies of the satellite attitude angles measured during image acquisition (de Franchis et al., 2014b). This is achieved by using a few keypoint matches obtained with SIFT (Otero, 2014). Corresponding pairs of image tiles are then stereo-rectified, that is, resampled in such a way that their epipolar lines all become parallel, evenly spaced and aligned (de Franchis et al., 2014c). Each stereo-rectified pair of image tiles is then matched using a standard stereo matching algorithm, such as MGM (Facciolo et al., 2015).
Before the final triangulation step, the local RPC corrections from all the processed tiles are combined to compute a global correction of the RPC camera model of the second input image relative to the first input image. Each pair of image tiles is then triangulated with the globally corrected RPC camera model, to obtain a 3D point cloud. As the camera model correction is the same for all tiles, there is no discontinuity between the 3D points computed from different tiles. All the resulting 3D point clouds are then projected on the user-selected CRS grid to produce the output DSM.
DSM generation using S2P and Cartosat-1 stereo
All four Cartosat-1 stereo pairs were processed to generate DSMs of 5 m cell size. The DSM provides height values (in meters), with reference to the mean sea level (MSL). The DSMs lacked alignment with known georeferenced layers such as street networks and Cartosat-1 panchromatic data. This is because the RPC files are not accurate enough to enable the precise geolocation of processed DSMs (de Franchis, 2015). All four DSMs were realigned by comparing clearly visible building edges and street intersections in the DSM with the street network layer and high-resolution Google Earth (GE) basemap available as an XYZ tile in QGIS 3.8.0 LTR software. We refer to the realigned DSMs as North 1, North 2, South 1 and South 2 as shown in Figure 3. Digital Surface Model generated from Cartosat-1 Stereo scenes covering the study area.
nDSM generation using DSM and DEM
To generate the nDSM (normalized DSM) representing the feature height values from the Earth’s surface, we need to subtract the terrain from the DSM. For this, we resampled the 30 m SRTM DEM to 5 m using bilinear interpolation and then subtracted this from the DSMs. The computed nDSMs exhibited a clear spatial bias due to RPC errors (de Franchis, 2015; Titarov, 2008). For instance, in Figure 4(a) there is a clear spatial trend visible in the north-west to the south-east diagonal direction. As described in the next section, this was rectified using a GCP dataset generated by counting the number of occupiable floors in buildings. Collected ground control points overlaid with (a) spatial bias and (b) bias-corrected nDigital Surface Model.
nDSM calibration for spatial bias elimination
The collection of GCPs
To calibrate and validate the nDSM, a total of 1270 GCPs were collected through field survey within and outside the municipal corporation boundary of Bengaluru, over a period of 4 days. Of this, about half the GCPs were collected in 2016 and the remaining in 2019. The GCPs were generated by counting the occupiable number of floors in buildings at these locations. Water tanks and other minor structures on the roof were ignored in this process. Since the Cartosat-1 imagery is from 2012, we used historical Google Earth imagery to make sure that only buildings that were existing in 2012 were used in the GCP field survey. The approximate height of the buildings was estimated by multiplying the number of floors assigned to each GCP by a floor height of 3m. Our results show that this is a sufficiently robust way of getting GCPs without measuring the exact heights of the buildings. Figure 3 shows the distribution of our GCPs.
During GCP collection, we focused only on residential structures as they possess an average height of 3 m, whereas, in the case of commercial structures, mainly factories, the warehouse kind of structures possess high ceilings in a single story structure itself. Some GCPs were also collected from open areas and roadways (through field survey as well as using Google Earth), and these were assigned a floor and height value of zero.
GCP calibration/validation split and outlier removal
Ground control points statistics when two SD was used as threshold for outlier removal (Bengaluru).
Cal: calibration; Val: validation.
Ground control points statistics when three SD was used as threshold for outlier removal (Bengaluru).
Cal: calibration; Val: validation.
A first-degree polynomial trend surface (TS) is then fit using Errorc1 values. Since there is a possibility of errors in the calibration and validation GCPs, we removed outliers in all GCPs based on this trend surface. Errors in GCPs can arise either from GCPs not being located on the correct building or because trees could be covering the roofs of some lower height GCPs. We found no significant pattern in the location of outliers, and they were found in diverse types of neighborhood settings ranging from low-rise high-density neighborhoods to those with tall high-rise buildings.
For outlier removal, we first computed the difference between the trend surface (TS) and Errorc1 as shown in equation (2) to calculate Errorc2.
We then computed the mean and standard deviation (SD) for Errorc2. Calibration GCPs for which Errorc2 values are more than two SDs from the mean were then removed. The process was repeated for the validation GCPs (GCPv) as per Equations (3) and (4) to generate a cleaned validation GCP set.
It is important to note that the trend surface above was computed only using Errorc1 derived as per equation (1) using only calibration GCPs. Hence, the validation GCPs do not influence the trend surface-based calibration process. In addition, although outlier removal of validation GCPs is conducted using Errorv2, the mean and SD used to identify outliers in the validation GCPs are computed only using Errorc2.
The cleaned calibration GCPs were then used to generate a new first-degree polynomial trend surface, using the difference between the cleaned calibration GCP heights and predicted heights given by the nDSM values. This second trend-surface was combined with the original nDSM to remove spatial bias. The resulting bias-corrected nDSM had several pixels with null values due to occlusion in the stereo imagery. These null value cells were converted to zero to complete the nDSM generation process.
After correcting the nDSM for spatial bias, we extracted the elevation values for built-up cells to generate the building height layer. The binary built-up layer was derived from the 5m LISS-IV data.
Accuracy assessment
The cleaned validation GCP set was used for accuracy assessment. The predicted height values for these GCPs were extracted from the building height layer. The difference between this predicted height and the actual height was then used to compute the root-mean-square error (RMSE) and mean absolute error (MAE). The RMSE and MAE values for all the four scenes are presented in Tables 2 and 3. A binned scatter plot for the cleaned validation points from all four scenes is shown in Figure 5. Binned scatter plot of predicted and actual building height with 1m bins. Graph combines values for validation points from all four scenes of Bengaluru. Dark color represents higher density of the points in the bin and vice-versa.
Results and discussion
Table 2 shows the number of calibration and validation outliers, cleaned calibration and validation points and the RMSE and MAE when using 2SD from the mean of Errorc2 as the threshold for outlier removal. Table 3 shows the corresponding values when using 3SD from mean of Errorc2 as the threshold for outlier removal. As evident from Table 3, even when using 3SD as the threshold for the outlier removal, the RMSE is still in the range of 3.3–4.2 m, which is not considerably more than a floor.
On investigating the distribution of the error, we found that the variance is higher for low-rise buildings (Figure 5). This is most likely arising from the use of SRTM data for terrain removal since the 30 m SRTM dataset has an error of approximately 3–4 m in urban areas (U.S. Geological Survey, n.d.). In future work, we intend to investigate if open-source ground point filtering algorithms can improve the accuracy of our building height maps. Another reason for higher variance for low-rise buildings could be because the disparity or shift that happens when these low-rise buildings are viewed from two points in space is often too subtle to be adequately captured by 2.5 m resolution stereo imagery.
Nevertheless, as indicated by our results, city-wide RMSE values are in the range of one floor and hence the method we describe is viable for city-scale analysis. For instance, as one of our co-authors has shown, the above building height map is a valuable input for generating high-resolution population distribution maps (Balakrishnan, 2020).
After accuracy assessment and extraction of values for built-up cells only, all the four resultant building height maps were merged to create the final building height map for the entire city (Figure 6). Figure 7 compares the raw Cartosat-1 Stereo scenes, the Fore and Aft images, with the GE image and the final building height map for the same area. Zoomed in images of the building height map and corresponding GE image for two more areas of Bengaluru are shown in Figure 8. The maximum building height value obtained for Bengaluru is 123 m. Building height map for area within the municipal limits of Bengaluru. Comparison of the Cartosat-1 Stereo (a) Aft & (b) Fore scene with (c) Google Earth image of 2011 and (d) final building height map for area highlighted with rectangle 2 in Figure 6. Comparison of Google Earth image of 2011 with the building height map for areas highlighted with rectangles 1 and 3 in Figure 6.


Building height distribution across Bengaluru
Figure 6 shows some distinct patterns in the building height distribution across Bengaluru. Large clusters of tall buildings (red) can be clearly seen towards the eastern and south-eastern edge of the city, and also in the central, southern, and north-western parts. Some clusters of high rise buildings can also be seen towards the southern part where a patch of urban growth extends beyond the municipal boundary. In general, the south-eastern quadrant of the map has considerably more high rise buildings than other parts.
The south-eastern quadrant, and especially the periphery of the city from south to east, has witnessed considerable construction activity since the early 2000s as a result of growth in the Information Technology sector in the city. As a result, this area has numerous large apartment buildings and office towers, and this is clearly visible in our building height map.
Several dense patches of medium rise (yellow) buildings can also be seen towards the center, and west, south-east and north-east of the center. While these are dense neighborhoods where most buildings are of 4–6 floors, there is a distinct patch of lower rise buildings (deep green and blue) just south of the center. This consists of relatively older and affluent neighborhoods, which mostly have independent homes of 2 or 3 floors.
Figure 7 shows part of the eastern edge of the city, which has numerous high rise buildings. As evident from these images, the individual buildings can be identified clearly in our building height map. Figure 8 shows a closer view of two different patches from the northern and central parts of the city. A comparison with GE imagery shows that our building height map is able to capture the diversity of building heights in these neighborhoods.
To examine the generalizability of our method, we implemented our workflow using data for Mumbai, which is a city of roughly 12.5 million inhabitants as of 2011 (Census of India, 2011). Figures 9 and 10 show the building height maps for the entire city of Mumbai and for closer views of two different patches of the city. Table 4 lists the number of calibration and validation outliers, cleaned calibration and validation points and the RMSE and MAE when using 2SD for outlier removal for all four Cartosat-1 scenes used for Mumbai. As evident from Table 4, the accuracy is comparable to Bengaluru and indicates that our method is generalizable to other Indian cities. Building height map for area within the municipal limits of Mumbai. Comparison of Google Earth image of 2011 with the building height map for areas highlighted with (a) rectangle 1 and (b) rectangle 2 in Figure 9. Ground control points statistics when two SD was used as threshold for outlier removal (Mumbai). Cal: calibration; Val: validation.

Future work
We intend to test our method in other cities with very different morphological characteristics so that the generalizability of the approach can be better established. For instance, in some cities, the presence of urban canyons, which may lead to more occlusion in stereo imagery, may be a concern. In such cases, more stereo scenes could be required to generate building height maps of acceptable quality. The method we describe is easily replicable since the software is open-source, Cartosat-1 imagery is readily available and the overall workflow is inexpensive to implement. Since S2P is compatible with stereo data from several satellites, many of which provide higher resolution data, the method we have outlined in this paper can also be used to generate higher resolution building height maps. We hope that the ease of replication of our method will encourage other researchers to test out our workflow in other types of urban settings.
Conclusion
Although satellite stereo imagery has been available for more than two decades, much of the literature focusing on the structure and nature of urban growth has focused on two-dimensional analysis of cities. This is most likely due to the need for proprietary software and DGNSS surveyed GCPs for accurate building height map generation from stereo imagery. In this paper, we described a workflow that is not dependent on proprietary software or DGNSS surveyed GCPs. We demonstrated an implementation of our workflow using 2.5 m resolution Cartosat-1 stereo imagery for Bengaluru and Mumbai. We used the open-source software called Satellite Stereo Pipeline for stereo image processing and a GCP dataset of floor counts of buildings for calibration and accuracy assessment. Since the RMSE error in our building height map is of the order of one floor, our workflow should be sufficient for most city-scale analysis. Given that much of the future urban expansion is expected to happen in developing countries with severe resource constraints, a fully open-source and economical solution to stereo image processing and building height mapping is critical to enable new types of three-dimensional urban spatial analysis that can help plan and manage these cities better.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was completed with support from the PEAK Urban programme, funded by UKRI’s Global Challenge Research Fund, Grant Ref: ES/P011055/1.
