Abstract
In this article, a new method called spatial amplifier filtering is proposed. The presented method is related to Moran eigenvector filtering and allows the accentuation of spatial structures in heterogeneous data sets. The spatial amplifier filtering technique is based on the inclusion of certain eigenvectors of a spatial weights matrix into a regression model. The application of this method can be seen as a pre-processing step prior to subsequent analyses, and to separate different types of spatially correlated components in a data set. For this purpose, three different types of the so-called spatial amplifiers are proposed, each consisting of different subsets of eigenvectors of the weights matrix. These amplifiers can either emphasise the positive or negative spatial autocorrelation, or spatial structuring in general. In this way, it is possible to make desired spatial structures more visible, especially in spatially highly mixed data sets, whereby the focus here is on geosocial media data. In the empirical part of the article, it is first shown why georeferenced social media data are difficult to handle from a spatial analysis perspective, motivating the need for the method proposed. Subsequently, the technique of amplifier filtering is applied to two data sets: a census data set from Brazil and Twitter data from London. The results obtained show that the method is capable of strengthening existing spatial structures and mitigating potentially disturbing spatial randomness patterns and other nuisances. This facilitates the interpretation especially of the Twitter data used. While the analysis of the unfiltered Twitter data with established methods reveals little information about possible spatial structures in the tweets, the filtered data offer a much clearer picture with distinguishable clusters. In addition, the method also provides insights into the internal irregularity of spatial clusters and thus complements the toolbox for investigating spatial heterogeneity.
Introduction
Over the last decade and a half, the spatial, social and planning sciences have been confronted with a steadily growing number of georeferenced data sets, many of which are user-generated (Goodchild, 2007; Mocnik et al., 2019). There are many ways in which we now routinely leave our digital traces, for instance, when we use everyday public transportation, shop online and use cellular networks. Moreover, never before have so many people been able to produce and consciously publish digital content by mirroring parts of their lives into the so-called cyberspace. Novel platforms have thus emerged that allow users not only to use content but also contribute their own data, turning users into ‘produsers’ (Coleman, 2009; Haklay et al., 2008; Ritzer et al., 2012). Among the spatial examples of those platforms are citizen science projects such as Geograph or eBird (Dykes et al., 2008; Sullivan et al., 2009) and platforms for the collection of largely topographic information such as OpenStreetMap or Wikimapia (Bennett, 2010; Ballatore and Jokar Arsanjani, 2019). Social media feeds like Twitter are related to these, but spatial information is often only a secondary feature and not necessarily at the core of the intended communication (Leetaru et al., 2013; Steiger et al., 2016c). The intrusion of all these digital tools and technologies into our everyday lives has had a profound impact on both society and academia: Our boundaries of privacy have shifted (Tasse et al., 2017), the ways we interact with and communicate places have changed (Kitchin and Dodge, 2011; Kitchin et al., 2017; Li et al., 2018; Poorthuis and Zook, 2013), some people have found new ways to create and curate a digital alter ego including its distinct spatial aspects (Saker, 2017), and researchers from a variety of disciplines have used the outlined novel, often easily accessible information for their research (cf. Sarker et al., 2015; Toivonen et al., 2019; Wang and Ye, 2018). This article is a methodological contribution within the context outlined and focuses on the so-called ambient geospatial information (Stefanidis et al., 2013), that is, on data from platforms to which people may unconsciously contribute spatial information en passant, whereby special emphasis is placed here on Twitter data.
Data sets like those extracted from Twitter are not produced according to scientific principles. Instead, corresponding data sets reflect everyday circumstances making it difficult to work with them in scientific contexts. Several issues including geographical ones have been identified that are caused by the underlying data production mechanisms. A frequently cited critique of the use of social media data is its lack of representativeness. The average user of social media is young, affluent and above average educated, but not necessarily representative for large parts of their societies (Li et al., 2013; Longley et al., 2015). Another important challenge concerns the validity of the results obtained from social media data with regard to the geographical area from which the latter are taken. It has been shown that a significant number of contributions come from users who are not resident in the areas studied (Johnson et al., 2016), although the extent of these contributions varies depending on the social media platform and city investigated (Rzeszewski and Beluch, 2017). Another important geographical concern is the self-selection bias in the contribution patterns of users. Many users deliberately communicate their whereabouts selectively in order to improve their digital self-representation, resulting in an unrepresentative coverage of places represented in social media data sets (Barkhuus et al., 2008; Evans, 2011; Saker, 2017). In addition, a pronounced underrepresentation of rural areas is evident on many platforms (Hecht and Stephens, 2014; Mislove et al., 2011), as well as digital divides between the Global North and South, and also within cities (Kelley, 2013; Otioma et al., 2019). The use of geosocial media data in geographical, sociological and especially policy-relevant planning research therefore remains promising yet challenging.
All of the outlined societal and demographic factors have an impact on the spatial analysis of social media data. Moreover, there are also problems more technical in nature that occur with geosocial media data sets (see Steiger et al., 2015a for an overview). One such problem is that the scales of the phenomena reflected in social media data are unknown (Sester et al., 2014). For example, it has been shown how the scales suggested by Twitter data tagged with a hashtag indicating the victory of a basketball team have changed over time (Crampton et al., 2013). This lack of knowledge about scales leads to an overrepresentation of granular analytical scales (Westerholt et al., 2015), as urban topographies from which Twitter users send their messages induce this geometric pattern in the data sets. As different phenomena captured on possibly differing scales are reflected simultaneously in social media data sets, this has profound implications for the assumptions of stationarity and the inference mechanisms of spatial analysis techniques applied to such data (Westerholt, 2019; Westerholt et al., 2018, 2016a), as well as with respect to a general mismatch between analysis and phenomenon scales (Zhang et al., 2014). Many spatial analyses of geosocial media data are carried out in aggregated form (e.g. de Andrade et al., 2021; Yan et al., 2017). The effects of the technical problems outlined above are then less acute due to the averaging nature of such analyses, but this changes if these are to be conducted in a non-aggregated form. This article aims to mitigate the technical impact of some of the effects outlined especially for non-aggregated analysis of geosocial media data.
This article presents an approach to accentuate spatial structures in heterogeneous geosocial media data sets. The presented method is related to Moran eigenvector filtering (Getis and Griffith, 2002; Griffith, 1996, 2000). Eigenvectors of the spatial weights matrix are used in the proposed solution to filter a variable of interest with respect to different kinds of unwanted spatial effects, thereby enhancing the structures actually contained. In contrast to Moran eigenvector filtering, the constructed filters called ‘spatial amplifiers’ here use eigenvectors associated with spatial randomness and either negative or positive spatial autocorrelation. In this way, spurious patterns caused by different phenomena reflected simultaneously and in close spatial proximity in the data are attenuated, while the adjusted and thus more pronounced clustering or repulsion effects are captured in the residuals of the filtering model. In the following, the article first introduces the solution briefly summarised above. The approach is then tested with two real-world data sets including a Twitter sample covering one year of London tweets and Brazilian census data. The latter are used to evaluate the proposed methodology based on a known and therefore easier to interpret data set. Application of the proposed filtering to the highly autocorrelated census data intensifies apparent structures and reveals details about the internal spatial heterogeneity of clusters. In contrast, applying the same filters to the Twitter data set allows clearer identification of spatial structures in the otherwise spatially heterogeneous raw data.
Spatial amplifier filtering
The novel approach put forward in this manuscript is inspired by Moran eigenvector filtering (MEF) (Getis and Griffith, 2002; Griffith, 1996, 2000). The MEF technique builds on the eigenfunction representation of Moran’s I as described by Tiefelsdorf and Boots (1995, 1996). The eigendecomposition of the spatial weights matrix determines both the feasible range and shape of the distribution of possible Moran’s I values under random conditions, that is, assuming no systematic spatial structuring. A direct connection hence exists between the eigenvectors of the spatial weights matrix and the spatial structures that can be represented (and disclosed) using that matrix. The technical principle of MEF is to regress an attribute vector
The approach proposed in this paper is based on the reverse thinking of the MEF principle. Instead of using those eigenvectors associated with strong spatial autocorrelation, the idea put forward here is to construct selective filters in order to amplify certain structures of interest that may be contained but hidden in spatially heterogeneous data like that extracted from geosocial media. Following the proposal from Griffith (2006) to include both eigenvectors indicative of positive (
The amplifier
The amplifier
Analogously, the
Models with spatial amplifiers are complete in terms of the spatial effects to be controlled. The uncorrelated nature of the eigenvectors and their ability to collectively capture all possible spatial patterns that can be represented by the spatial weights matrix used allows to highlight certain spatial features of interest. The residuals thus represent the share of variability that cannot be explained by spatial lags associated with spatial randomness and either negative or positive spatial autocorrelation, depending on the type of amplifier. We therefore find in the residuals the unexplained variability that is either related to positive (negative) spatial autocorrelation or to possibly unknown non-spatial factors. Analogous to the MEF, the latter could be included in the model specification in addition to the spatial amplifier if required. However, perhaps the latter is not as interesting as in the case of MEF, since the goal of using spatial amplifier filtering is not to explain a variable in terms of regressors, but to maximise the noticeability of certain types of spatial patterns for the exploratory detection of structures of interest.
A question that arises with all the amplifiers presented is the selection of the eigenvectors for inclusion in the respective filter. The aim of amplifier filtering is to eliminate certain parts of a spatial pattern in order to amplify another. To achieve this without inadvertently removing a potentially meaningful spatial pattern, the eigenvectors contained in
Empirical illustrations
This section provides empirical insights into the use of the amplifier filtering approach as proposed in this article. The method is applied to two real data sets, both of which are described in more detail in the following subsection. Afterwards, the functioning and behaviour of amplifier filtering are demonstrated.
Data sets
One of the two data sets used is a data set from the 2010 Brazilian census, which can be downloaded from the Brazilian statistical office (IBGE, 2010). The data represent the residents of the city of Recife and consist of 4234 census tracts (so-called enumeration areas) covering the city’s administrative territory including its rural peripheries (see Figure 1(a)). The variable used is the average total disposable monthly income of households measured in Brazilian reais. The census data set represents the case of systematic data collection according to a consistent methodology. Each urban tract covers between 300 and 350 households all of which were assessed by the same human enumerator. Similarly, each rural tract contains about 150 agricultural holdings. Using this data set, it is possible to understand the results obtained from applying amplifier filtering to relatively consistent data. The main purpose of using the census data is hence to serve as a counterpart to the tweets introduced below, which represent the opposite case of a heterogeneous data set.

Illustration of the data sets used in this article. Background map data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org. (a) Census data from Recife, Brazil. (b) Tweets from Canary Wharf, London, UK.
The Twitter data set used is an extract from a collection of tweets that was already used in previous research. It is publicly available from Westerholt et al. (2016b), and a detailed description of all pre-processing steps that are briefly described in Table 1 is provided by Steiger et al. (2015b). The original full sample of 20 million tweets covers one year of georeferenced Twitter data from the Greater London area, UK. The sample was collected using the Twitter real-time Application Programming Interface and it includes mainly tweets positioned via Global Positioning System (GPS) and Wi-Fi. As a first step, the text messages of the tweets were converted into tokens before being cleaned of stop words, white space and punctuation marks. The final pre-processing step was to apply stemming to make the text messages more consistent for further semantic processing. The attribute value used in this paper is the association of the tweet messages with a latent work topic that reflects tweets about commuting, office life and related topics. These topic associations were obtained by using latent Dirichlet allocation (Blei et al., 2003), a commonly used machine learning algorithm for detecting latent semantic features in text corpora. In this article, a geographic subset is used to facilitate the interpretation of the results of the amplifier filtering. The subset covers the Canary Wharf business district in East London (671 tweets; mapped in Figure 1(b)).
Description of the steps for pre-processing the Twitter data (based on Steiger et al., 2015b).
Spatial autocorrelation in the raw data
An application of spatial statistical methods using actual data is intended to motivate the amplifier filtering approach outlined. One way to statistically assess spatial structures is to quantify spatial autocorrelation, for which the spatially weighted covariance measure Moran’s I (Anselin, 1995; Cliff and Ord, 1969) is one of the most commonly used estimators. The correlogram for the tweets shown in Figure 2(a) is based on Moran’s I and illustrates the spatial autocorrelation within the tweets in terms of five-nearest-neighbour spatial lags. The correlation shows no noticeable trend or pattern across the lags tested, and none of the values shown is statistically significant. The line remains flat oscillating around zero, and this behaviour is also retained for higher order lags when the number of bins is increased. In contrast, the correlation calculated for the Recife census tracts also shown in Figure 2(a) shows strong positive spatial correlation at close vicinity, whereby the spatial weights are based on binary contiguity. This behaviour reflects clustering on the small-scale urban topography, but it is also retained for higher order lags. All Moran’s I values shown in the correlogram for the income variable are highly significant at

Correlograms and Moran scatterplots calculated for the work topic associations of tweets from Canary Wharf, London, and the income variable for Recife taken from the Brazilian census. Both correlograms are based on Moran’s I and the whiskers indicate twice its standard deviation in the null hypothesis. The lags are based on the five nearest neighbours for the tweets and binary contiguity for the income variable. The slopes of the red trend lines are indicative of the strength of spatial autocorrelation. (a) Correlograms for the work-topic associations of tweets and the census income variable. (b) Moran scatterplot for the tweet topic association; I = –0.0117, p = 0.651. (c) Moran scatterplot for the census income variable; I = 0.749, p < 0.01.
The lack of spatial structure in the tweets is further supported by the Moran scatterplots shown in the bottom of Figure 2. The scatter plot for the census tracts indicates a strong positive spatial autocorrelation (Figure 2(c)), which is strongly driven by the accumulation of high values. Therefore, most of the data points are grouped in the first quadrant, and the steep trend line visualises the high and significant Moran’s I of 0.749. The picture is less clear in the Moran scatterplot derived from tweets (Figure 2(b)). The data points are organised around the centre and occupy all quadrants. There are also no local outliers visible, which is an important observation that suggests that the results from the correlograms outlined above are not mere artefacts of the averaging character of global Moran’s I. The trend line remains flat, which is reflected in the low and non-significant I of –0.0117. Overall, it can be said that the sophisticated spatial constitution of geosocial media data caused by their complex production process, as found in the tweets analysed here, is difficult to investigate with the standard spatial statistical tools available.
Spatial amplifier filtered Recife census data
The application of the amplifier

Maps of standardised values and LISA (local indicators of spatial association) clusters for the raw and amplifier-filtered income variables in Recife, Brazil. All maps are focused on the central district of Jaqueira to improve their readability. The cluster maps are based on local Moran’s I and illustrate the nature of the significant spatial associations in terms of the quadrants of the corresponding Moran scatterplots. The Benjamini–Hochberg method (Benjamini and Hochberg, 1995) was used to correct for possible multiple hypothesis testing problems in drawing inferences. (a) The standardised income variable. (b) Clusters (income); I = 0.749, p < 0.01. (c) The standardised residuals from

Moran scatterplots calculated for the amplifier filtered income variable for Recife. The slopes of the red trend lines are indicative of the strength of spatial autocorrelation. (a) Moran scatterplot for the
Applying the filter
A noteworthy diagnostic feature of using both amplifiers in a combined fashion (that is, looking at both results outlined above simultaneously) is the observation of spatial units that show contradictory behaviour between the two filtering results. Some spatial units show equally positive or negative values for both types of filtered residuals. This behaviour often indicates variation that remains unexplained because the spatial neighbourhood relationships are not the driving factors. Instead, non-spatial factors can be the driving forces of the generated residuals in these cases. This happens, for example, when typical clusters of high-income values are interrupted by a different form of land use. The light-green low-high feature found in the northern part of the high value cluster in Figure 3(d) and (f) likewise is one example for this. The corresponding census tract contains a large psychiatric hospital, which affects the local structure of the income variable in non-spatial ways. The detection of such conflicting behaviours using both types of spatial amplifiers can be helpful in interpreting the results of spatial analyses, as they point to potentially deviating and therefore technically problematic spatial units.
Applying both filters to the easily interpretable census data shows that they work as expected. The
Spatial amplifier filtered London Twitter data
The Twitter data is more heterogeneous than the census data set. This is illustrated in Figure 5(a), which shows a map of the standardised original topic association variables in the Canada Square area of Canary Wharf, London. High- and low-attribute values appear spatially mixed and it is almost impossible to identify clear structures. There are certain trends, such as a generally higher level of topic association in the central part near Canada Square Park and an area with generally lower levels around Jubilee Park. However, only a few of those structures are identified statistically as spatial clusters using local Moran’s I, as can be seen in Figure 5(b). Very few tweets are flagged as significant, and some of them show negative autocorrelation indicating the co-location of diverse, repelling tweets. Overall, it is hard to gain insights from applying Morans I to the plain work topic associations regarding possible spatial structures.

Maps of standardised tweet work topic associations and corresponding LISA (local indicators of spatial association) clusters in Canary Wharf, London. The focus is on the central area around Canada Square to improve readability. The cluster maps are based on local Moran’s I and show the nature of significant spatial associations by means of Moran scatterplot quadrants. The Benjamini–Hochberg method (Benjamini and Hochberg, 1995) is used in the inferences to correct possible multiple hypothesis testing problems. Background map data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org.
Using

Moran scatterplots calculated for the amplifier filtered topic associations for Canary Wharf, London. The slopes of the red trend lines are indicative of the strength of spatial autocorrelation. (a) Moran scatterplot for the
The residuals obtained from applying filter
The results obtained with the amplifier filtering are promising, but they are also subject to limitations. One limitation of the proposed approach is that it only affords to filter out variability that can be explained by spatial configurations. In many practical cases, however, other factors will be important confounders as well, and it will often be necessary to identify these using local and domain-specific knowledge. However, once identified, such confounders can be added to the filtering model as additional explanatory variables, as has been briefly outlined in the methodological section. Since such extra variables may be spatially structured, which would violate model assumptions, it may be necessary in practice to pre-filter them either by Moran eigenvector filtering or by one of the filters proposed in this paper. A general limitation of the approach put forward that arises in connection with geosocial media data is that such data are generally subject to a high degree of uncertainty. The text components of tweets are rather short, which makes the application of natural language processing algorithms a challenge. Therefore, despite extensive pre-processing, the highest work topic associations found in the data set used are still relatively low (in the middle 40% range). Of course, this problem is only indirectly related to spatial amplifier filtering as a technique, but it makes it difficult to interpret the results that can be obtained in the target application domain. Overall, however, the results obtained demonstrate the usefulness of spatial amplifier filtering as a possible pre-processing step in geosocial media analysis and beyond.
Conclusions
This article introduced spatial amplifier filtering, a technique for emphasising spatial structures of interest in spatially challenging data sets. The approach is based on the inverse principle of Moran eigenvector filtering by using eigenvectors associated with spatial randomness and, additionally, those associated with either positive or negative spatial autocorrelation. These are included in a regression model to account for the variability explained by uninteresting or spurious spatial configurations. In the residuals, an accentuated version of the original data remains in which the spatial structures of interest are more pronounced. In this way, it is possible to reduce potential confounding effects such as spatial randomness due to inadequate data acquisition protocols or negative spatial autocorrelation that may be caused by measurement errors. The method was tested using a census and a Twitter data set. In both cases, the results showed that the method works in the desired manner and leads to encouraging results. While the apparently positive spatial autocorrelation structures were reinforced in case of the census data, it was possible to largely remove the disturbing spatial randomness from the Twitter data set. Furthermore, the census example has shown that the filtering process can also be used to reveal internal heterogeneity within spatial clusters. It has thus been shown that the method presented in this article can be a useful addition to the spatial analytical toolkit, especially for pre-processing complex spatial data sets such as those from geosocial media.
With regard to the intended main field of application, the proposed method can lead to a better understanding of the geospatial properties of social media data in particular and of human-generated data about everyday life in general. The proposed filtering approach helps to overcome a significant hurdle in the non-aggregated and thus fine-grained geographical analysis of geosocial media data. Many geosocial media analyses are conducted in aggregated form, but this poses the modifiable areal unit problem (MAUP) (Openshaw, 1984), a longstanding but still unsolved challenge in the spatial sciences (Wolf et al., 2020). The MAUP not only raises problems related to the arbitrariness of aggregation units. In the case of geosocial media data, the MAUP is also responsible for hiding much of the spatial complexity that is visualised in the illustrations throughout this article. These problems, however, do not vanish altogether by mere aggregation, they only become less obvious and, in the worst case, still have an unrecognised influence on the analysis results obtained (Westerholt, 2019; Westerholt et al., 2015, 2016a). The use of spatial amplifier filtering supports the non-aggregated analysis of geosocial media and other types of human-generated data. It can thus make an important methodological contribution to reducing the impact of the MAUP, to the better geospatial understanding of geoscoial media data, and hence to a more meaningful use of the much criticised social media data in academic studies.
The method presented in this article is not limited to geosocial media data. The empirical demonstrations have shown that even the application to traditional data sets can reveal interesting structures that might otherwise have gone unnoticed. Furthermore, there are other crowdsourcing and user-generated data sets, some of which share certain characteristics with geosocial media data (Mocnik et al., 2019; See et al., 2016). It would be instructive to see how spatial amplifier filtering works with these closely related types of data sets, not only to possibly gain clearer insights into their spatial structuring but also to learn more about the behaviour of the proposed methodology in different contexts. The latter also includes further systematic investigations for a more in-depth interpretation of the results of the amplifier filtering, a topic related to the concept of grounding in the sense of linking the results to their actual environments (Harnad, 1990; Mocnik et al., 2018). Other areas of application include analysing data from georeferenced surveys and citizen sensing (Bluemke et al., 2017; Boulos et al., 2011), investigating complex human-geographical phenomena such as the spread of traditions (Caldwell and Eve, 2014; Mocnik, 2018), informal settlements in the Global South (Klemmer et al., 2020; Thomson et al., 2020) and many others. Another area to which the proposed filter for emphasising negative spatial autocorrelation in particular can contribute is spatial heterogeneity research (Ord and Getis, 2012; Westerholt et al., 2018; Xu et al., 2014). Thus, while focusing on geosocial media, the methodological contribution made in this article is by no means limited to these kinds of data sets.
Supplemental Material
sj-zip-1-epb-10.1177_2399808320987235 - Supplemental material for Emphasising spatial structure in geosocial media data using spatial amplifier filtering
Supplemental material, sj-zip-1-epb-10.1177_2399808320987235 for Emphasising spatial structure in geosocial media data using spatial amplifier filtering by René Westerholt in EPB: Urban Analytics and City Science
Footnotes
Acknowledgements
Thanks go to Franz-Benjamin Mocnik (University of Twente, the Netherlands) for his feedback, which has significantly improved the draft of this article. Furthermore, I would like to thank the anonymous reviewers, whose comments also contributed to the strengthening of the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
