Abstract
In the following paper, we contribute to the research on outlier handling, concentrating on economic statistical data, namely observations in housing statistics. In order to create indices for changes in price, data cleaning, as well as model-optimizing is required – and for both, identifying outlying observations is crucial. By applying various techniques, such as distance-based and density-based outlier detection methods, we highlight the importance of dealing with outliers and discuss the difficulties one might encounter. Housing statistics is a special case, as there is a high correlation between price and the area of the dwelling in question, but it still serves as a fine example of handling outliers in economic and transaction-data. We show that identifying outliers is a rather nuanced thing, where statisticians could benefit from using advanced algorithms – such as the Local Outlier Factor (LOF), or the Feature Bagging Outlier Detection (FBOD).
Introduction
Identifying outliers has become a more nuanced area of statistics in the last 40 years: as more data becomes available, it has become essential to clear the dataset by filtering out noisy, non-adequate observations. According to [1], outliers are observations that differ from other observations so much, that it becomes suspicious that other mechanisms played a role in generating them. Similarly, [2] defines outliers as observations that are inconsistent with other observations in the same data set. With their distance-based (“DB”) definition, Knorr and Ng describe outliers in the following way: “An object O in a dataset T is a DB (p, D)-outlier if at least fraction p of the objects in T lies greater than distance D from O.” [3, p. 393].
The reason for outliers being present in data sets may come from various sources and mechanisms. In the classification of [4] two groups of outliers can be distinguished: they may either arise from mistakes made during the recording of the data, or they could occur naturally, due to the uncertainty coming from the measurement of the phenomena. Later, [5] provided a much more thorough classification of outliers. With its more nuanced view on anomalies, it becomes clear that analyzing outliers means more than just filtering out non-adequate observations, as we can also identify systematical differences in the data.
In our current article, outlying observations are considered primarily in cross-sectional or panel databases, however, there is a vast literature on handling outliers in the context of time series analysis. In time series, we can make a distinction between global outliers (which are conceptually different from the rest of the time series), contextual outliers (which are outlying observations in their respective time window), and even entire series can be considered outliers when compared to similar data [6]. Identifying these types of outliers, however, is not in the scope of the current paper.
Regression lines and the role of outliers.
Anomalies considered to be clear errors in the data (such as an error by an order of magnitude in one of the factors) are important: they show a strong difference from the rest of our observations, and thus may potentially harm our models. In the case of the assumption of a simple linear regression between two (or more) factors, a couple of false observations may change the line fitted to the observations at hand. Such is the case in Fig. 1, a simple regression analysis based on R’s “cars” database: the x-axis depicts the speed of the car, while the y-axis is the braking distance needed for each car. The five outliers, in this case, can be identified easily, without any calculation – filtering these observations out would resolve in a much less steep regression line. This however opens a more philosophical discussion on justifying the exclusion of these observations – a question to which there is no clear answer, as it depends on the scope of the analysis.
Due to their distinct properties, we expect the number of outliers (O) to be significantly lower than the number of inliers (I) – the data which is non-outlying –, or:
In official statistics, with more and more data being available via the internet, data-cleaning and filtering for anomalies are getting more and more attention. For example, in the case of advertising data on the internet, errors by orders of magnitude may arise due to human error when manually inputting the data. As noted before, when our database is small, identifying outliers means no problem, but when dealing with large amount of observations, simple data-visualization-based methods do not suffice. For this reason, mapping out our alternative might prove beneficial.
The goal of our paper is to introduce newer, distance and density-based methods that are not widely used in official statistics, and demonstrate their behavior when applied to housing statistics data. However, it is not in the scope of this paper to rank these. We believe that the outlier detection is such a wide area of expertise that there cannot be a one-fits-all solution for all scientific areas. There had been attempts to rank various outlier-detection methods (for example [7, 8]) however, in an environment where we can only consider unlabeled observations (where we have no existing information on the properties of outliers), these evaluations are not straightforward. Furthermore, we would like to highlight and elaborate on some of the difficulties arising when applying these methods. Although some of the papers include recommendations on how to use these algorithms ([9, 10], for example), there are some issues that have not been addressed.
The paper continues in the following order: first, we will introduce various classifications and approaches of outlier-identifying algorithms. Then, we will introduce the more novel, distance-based methods such as Local Outlier Factor (or LOF) and its modifications (such as the Feature Bagging for Outlier Detection, or FBOD), emphasizing their advantages and disadvantages. Finally, we apply these methods to housing rent data used for official statistics in Hungary.
When dealing with outliers, there exist several simpler methods at our disposal such as boxplots and bagplots, Mahalanobis-distance, etc.. These methods, however, lack a more nuanced approach, thus either some outlying observations will not be identified, or we filter out too much of our observations. Using boxplots, for example, may result in the latter. These methods may also fail to reflect the importance of various stratifications in the dataset in question.
In the last 30 years, new techniques have been created for the purpose of outlier detection; these include various cluster-based or distance-based outlier-detection methods. As for classification, the relevant literature has different labels for these, depending on whether we put forward the filtering technique, the output of the algorithm, or the dimensions in which we consider outliers.
Based on [11], outlier detection methods can be divided into four groups. These are: statistical approaches, distance- or proximity-based approaches, profiling methods and model-based approaches.
Outliers in various dimensions, adopted from [13, p. 38].
Statistical approaches usually are based on the analysis of a stochastic distribution: in case a data point does not fit into these models, it is treated as an outlier. Examples of such methods are the boxplot and bagplot.
A lot of critiques have been cited in the literature against simple outlier detection methods based on distances (such as the method proposed by [3]). For example, [12], as well as [13] raise the question of the sensitivity of choosing the distance parameter. First, a-priori it is hard to choose an accurate parameter. Second, the more variables (or dimensions) we take into account during the outlier detection, the harder it is to pinpoint outliers, especially in the case of multi-dimensional analysis, where outliers are close to inliers in a lot of dimensions. Figure 2 demonstrates this phenomenon: in views 2 and 3, points “A” and “B” cannot be considered outliers. However, in view 1, “A” is clearly an outlier, and so is “B” in view nr.4.
Distance- or proximity-based approaches
Distance-based outlier detection methods take the distances between points inside an n-dimensional euclidean space. The further one data point is from the rest of the observations, the more likely it is that the data point is an outlier. An example of distance-based methods is the KNN-classification method [12], where, after the clustering of the data, observations that are further away from the clusters created by the algorithm are being omitted as outliers. However, deciding on the number of outliers – or deciding on a threshold for it – is not straightforward. One method could be to choose a cut-off value based on the top “n” or top “x percent” of the observations in the dataset with the highest distances (as it was suggested in [12]). As an alternative, [14] calculates with the sum of distances for all “k” number of neighboring observations, and takes the mean or median, thus weighting the distance-based evaluation.
Profiling methods
In the case of the profiling methods, the database is divided into groups based on statistical methods and heuristics; inside these profiles, observations are similar in some metrics. Data points that differ significantly from these profiles are labeled as outliers. Examples of using profiling methods include ranking healthcare services using logistic regression [15], and profiling employment status [16], where they could assign the probability of the length of unemployment for each person registered for unemployment benefits. Based on the result of profiling methods, targeted policies could also be implemented [17].
To screen for rare cases, [18] used such a profiling approach when investigating bank fraud. Their goal was to be able to identify when someone becomes a victim of financial fraud and to determine if their credit card is stolen. To distinguish between these cases, they had to choose an unsupervised method: on the one hand, there was little information available as to which transaction was not carried out by the owner, and on the other hand, as theft techniques changed, they had to effectively identify new cases. The “Peer Group Analysis” used in [19] is also a local method based on a clustering process: it groups account holders according to payment habits (e.g., how much they usually pay with a credit card) and examines whether recent spendings on a given account is significantly different from other account payments in the group.
Model-based approaches
Model-based approaches, as their name implies, are based on estimating the average phenomenon considered to be adequate by some predictive model: a significant deviation from the mean estimated by the model can be identified as an outlier. An example of this method is Cook’s distance [20], in which, after estimating an OLS-model, it is measured for each observation how much they affect the fit of the linear curve to the data point. The greater the influence of a given observation, the more likely it is to be an outlier. In the case of multidimensional outlier detection, the High Dimensional Influence Measure (HIM) [21] is also applicable, a method that is rather similar to Cook’s distance. However, a significant difference is that while Cook’s Distance is based on an examination of the coefficients of a linear estimate fitted to the data points, the HIM examines the influence of marginal correlations.
Besides the aforementioned types of outlier-detection, [22] distinguish their own “angle-based” approach, which is close to the statistical and profiling methods. This approach draws lines from each observation pointing to all the other points; outliers are being identified as points where the angles of the lines drawn strongly fluctuate. An example of the angle-based approach is the FastABOD (“Fast Angle-Based Outlier Detection”) [23], where the decision is made based on the variance of the angles.
Supervised and unsupervised machine learning methods had been examined for their efficiency, both in real-world and on artificial databases [24]. Supervised learning methods have been criticized due to their inability to detect new outliers [11], and their application in cases where the subject of classification is rare in the database [19]. Efficacy might come down to a lot of factors: [8] note that doing a comparative analysis on different methods is difficult due to the various dimensions, data structures of databases, and the choice and quality of parameters to define. Another difficulty is that there is no consensus on well-structured databases on which one could benchmark all outlier detection algorithms. For this reason, [7] created a database, where unsupervised outlier detection methods could be analyzed and ranked. Our goal, however, is to show how various methods behave in the analysis of economic data, ranking these is not in the scope of this paper. Moving forward, due to the nature of outliers, we will only consider unsupervised learning methods.
Density-based outlier detection
Density-based spatial clustering of applications with noise – DBSCAN
The Density-based spatial clustering of applications with noise (DBSCAN for short) is an algorithm created by [10] and was developed to classify observations according to the spatial location in large databases – which is, in essence, a density-based, spatial clustering procedure. The idea behind the identification method is that “(…) for each point of a cluster the neighbor-hood of a given radius has to contain at least a minimum number of points, i.e. the density in the neighborhood has to exceed some threshold.” [10, p. 227]. As such, to apply the method, two parameters must be specified: the distance parameter “Eps” and the minimum number of observations in the neighborhoods (or clusters), “MinPts”. The former is intended to measure the size (or radius) of clusters, while the latter gives a threshold for density; the two together determine the baseline for the clustering procedure. It is important to note that increasing “MinPts” increases the chance of an observation not being classified into any cluster: the more points there must be at a distance “Eps” from a given point, the more the clustering is concentrated in dense areas. As a rule of thumb, the minimum number of points is usually set at around 10 and 20 (however, this largely depends on the size of the database; a threshold lower than 100 observations is sufficient or necessary for finer estimates).
The problem with using DBSCAN is the non-trivial choice of the “Eps” parameter: marginal changes of this parameter, depending on the database, can significantly affect the outcome of the clustering procedure. One possible solution to this is the use of the KNN algorithm, which is also mentioned by the authors in ester1996density. According to the method, we calculate the distance of all points from the nearest point “k”, then we sort and plot the results: the point where the resulting curve starts to grow exponentially gives the optimal value “Eps”.
The method developed by [25] offers a more accurate solution for parameter selection. After the KNN algorithm, the local density function is calculated and then averaged for each point. With these values, a min-max normalization and binning are performed to filter out potential observation errors and baskets with a high number of outlying observations will not be taken into account during further calculations. Using the remaining observations, the algorithm calculates an optimal “Eps” parameter for a given basket.
LOF – Local Outlier Factor
The Local Outlier Factor Method (hereinafter: LOF) [9] builds on the foundations of the DBSCAN algorithm. Similarly, it tries to define outliers based on local density, i.e., within individual clusters. However, the authors highlight the comparison of the density of trained clusters: in DBSCAN, the minimum distance parameter and the minimum number of elements define a universal threshold for the clusters – where these clusters do not match, the observations there can be considered outliers. To identify density-based outliers, it is important to compare the density of clusters as well: for this, the area of the clusters must be treated dynamically. Therefore, for the LOF algorithm, only the “MinPts” parameter was left as an external parameter. During the procedure, we examine each data point and how far it is from the other points within their respective cluster trained with the same method.
The formula for local reachability density (lrd) described for point “p” is:
where the reachability distance is the distance from an adjacent observation at a distance of “PinPts” [9]. However, it was referred to as “MinPts” only in the original article, in versions implemented in R, they are referred to as “k” parameters (based on k-nearest neighborhood). Note that the local reachability density quotient can take on an infinite value if the “o” object has the same parameters as the “p” object – i.e.: if the values of the examined parameters are duplicated.
The formula of the Local Outlier Factor for given “p” point can be written as follows:
where “p” is the data point in question and “o” is an adjacent observation at the “k” (i.e.: fifth closest) distance from point “p” [9]. For LOF, similar to DBSCAN, the optimal choice of the “k” parameter is critical. Although [9] makes recommendations for specifying this parameter, their estimates were made for a much smaller database. The core idea is that the smaller the k parameter used in the calculation the more emphasized the “local” nature is, as it creates more and smaller clusters in this way. In contrast, with a larger parameter “k” we are moving more and more in the global direction – in essence, in the case of k
It is also important to mention the evaluation of LOF results. According to [9], values below 1 are certainly inliers, values around 1 are inliers, and LOF values above 1 are considered outliers – however, they cannot recommend a universal threshold as it is significantly influenced by the choice of the “k” parameter. For example, there may be a case where if a parameter is set too low, the observation in the middle of a dense point cloud is marked as an outlier due to over-fitting.
As shown above, the LOF and DBSCAN algorithms can be used effectively to filter and identify outlying observations, but their common problem is parameterization and decisions about outliers, which can sometimes be highly subjective. Examining the LOF algorithm, there is no general rule of thumb for what proportion of observations should be considered as outliers, and it is difficult to specify optimal parameters for “MinPts” in the application. It should also be noted that although [9] performed testing on thousands of samples, as the sample size increased (and especially the dimensions expanded), the computation time could increase significantly. Thus, the literature has come up with several proposed amendments over the last 20 years that have made LOF estimation more automated, simpler, and faster.
One of the problems with LOF is that there is no simple threshold of LOF score for which outliers could be identified. [26] modified the LOF algorithm so the output would be a probability measure – Lthus the name “Local Outlier Probablities”. [27] implemented a faster calculation method for the LOF; using the Expectation Maximization approach, FastLOF could identify outliers 80 percent faster than the original algorithm.
In their implementation, [28] combined the KNN, DBSCAN, and LOF algorithms (KDBLOF) to effectively filter outlying observations in smart grid data. Following the example of [25], the KNN algorithm is used to achieve a more accurate clustering procedure using a modified DBSCAN algorithm. The DBSCAN filters out a set of potential outliers - observations that have not been included in clusters or are located furthest from the central observation of each cluster. The Local Outlier Factor is now run only on this filtered set. This has two advantages: on the one hand, they were able to identify outliers with a smaller error during the tests, and on the other hand, because they had to be calculated on a smaller data set, the LOF algorithm ran significantly faster. A further advantage of using this smaller dataset is that due to the higher computational capacity, the KBDLOF algorithm is potentially more efficient for analyzing larger, multi-dimensional databases.
In their article, [11] highlighted the difficulty of outlier filtering, according to which it is difficult to filter out noise not only within individual variables but also between different dimensions in a multidimensional database. Classification procedures using the standard bagging technique can only be effective if the different variables considered are sufficiently different and accurate to prevent the same classification error from always occurring by repeating random sampling. It has previously been shown that bagging techniques are unable to improve the accuracy of standard K-Nearest Neighbors classifications because of the correlation between predicted values after multiple iterations [29].
The Feature Bagging for Outlier Detection (FBOD) algorithm [11] does not only samples data points from a database according to a random distribution but also chooses the explanatory variables randomly at each run. With each iteration, the observations are evaluated according to the LOF algorithm, and a final outlier score is obtained by combining the LOF scores in each iteration. In their test, the analysis based on Receiver Operating Characteristic (ROC) curves showed that the simple LOF approach was significantly influenced by the noisy explanatory variable set, while the FBOD was able to perform significantly better in these cases. However, it is worth noting that with a good quality database, and with proper parameter tuning, the LOF method could still achieve better results.
Tests on outlier detection – regional rental advertisement data
There are a lot of factors that need to be considered when looking for outliers in a data set. The first thing one needs to contemplate on is the goal of anomaly-detection: do we want to find all the non-viable data (including extreme values which actually might be relevant and can fit in our models), or just the data which have some other factors behind them being created (such as a mistake when giving the parameters of the data)? The answer to this question might change with the objective of our research: if we want to create a price index for houses – which in essence takes the mean of the observations –, then omitting luxurious houses might sound reasonable, as these are not necessarily relevant to the prices in general (in economics, Veblen goods tend to behave this way), but they can easily distort the mean of the distribution. As such, we might need to omit a bigger part of the data set as we would do otherwise. If we want to analyze the housing situation for each decile in a country, and our database includes the whole population we want to analyze, then we ought to omit just those which are clearly errors made during inputting the data – for example, the price is 100 times higher than what is predicted by our fitted model. Perhaps we are only interested in the outliers. For example, if we work at the Tax Authority, and would like to check suspicious transactions, we might concentrate on the lower and higher ends of the database.
The second thing we might consider comes from the nature of the data: there are various mechanisms behind the existence of outliers, as [5] showed. One might also consider these assumptions when approaching the definition of outliers in their respective case. A third one is the question of dimensionality, which coincides with the first point: in which (and how many) dimensions do we want to consider observations to be outliers? If, for example, one would like to perform preliminary data cleaning, then considering only a handful of the most important variables might suffice.
All the methods listed above have advantages and disadvantages: one might be easier to interpret, or easy to compute, while another might give better results, but at a cost of longer computation and difficulty with tuning parameters (as is the case with Local Outlier Factors). We would like to demonstrate the implementation of a couple of the methods listed above in the R environment. In the syntaxes, we will include the libraries we have been working with. We will be making use of the tidyverse environment as well, as it makes a lot of the data transformation easier.
About the database
For the analysis, we use online rental advertisements for Pest region from the Hungarian website ingatlan.com. On ingatlan.com, both private individuals and real estate agencies can post advertisements on dwellings available for rent. Users can add a wide range of data to the advertisement, starting from the size of the dwelling to the number of toilets, and can indicate things such as the number of parking lots and whether there is an elevator in the building. Additionally, ingatlan.com has a method to identify duplicate advertisements, so we can avoid taking into account the same observation multiple times. According to the Hungarian Central Statistical Office: “The database fully contains the following variables: settlement (district in Budapest), county, type of property (apartment or family detached house), subtype of property (for example: prefabricated dwelling, condominium), advertiser (apartment or office ad), advertisement status, floor area, advertising price, type of heating, number of rooms, number of half rooms, condition of property, air conditioning, number of bathrooms. Additional, partially filled in information is also displayed: street, floor, district/neighborhood, plot area, terrace area, comfort level, renovation level, utility fees, rental price, and parking. The database contains the status of the given observation, i.e. whether the ad is currently available (active) or not (inactive, archived, deleted). In addition, advertisers had the opportunity to state the reason for the termination of the ad (rented out, ads elsewhere, entrusted a realtor, etc.) when removing the ad.” [30].
We chose to restrict our database to advertisements created in 2019, as it was a relatively stable year for rental prices, and time variation has a smaller effect on outliers than in the case of looking at a longer period. We choose to filter for rentals in the Pest region for the same reason, as we would like to get a database as homogeneous as possible. We ended up with a database consisting of 5,736 observations. Another benefit of using a smaller portion of a larger database is that it is easier to demonstrate the effects outliers have, while it is also easier to visualize inliers and outliers using scatterplots. For demonstration purposes, we will mostly use only the price and area variables. First, it is much easier to look at outliers in two dimensions, as we can look at different methods in a way that is easily displayable. As we have shown above, the more dimension we add to outlier detection, the more abstract the outliers become. Looking at observations where further data points are easier to point out is better from a technical demo perspective. Second – which is related to the data itself –, the correlation between area and price of a house is really strong. Due to this, if an observation has a substantially bigger area for a building, while the price is at the lower end of the distribution, it can easily be considered an outlier.
Use of bagplot, LOF and FBOD on the rental advertisement database
We include a comparative graph for some of the outlier detection methods used for two-dimensional data. The methods we used are the bagplot, Local Outlier Factor, and Feature Bagging Outlier Detection. For each method demonstrated here, we used a 5% outlier filtering threshold (with the exception of the bagplot, which, by definition, identifies way more outliers). For calculating the bagplot, we can use the R package “plpack” [31]. The code is shown below: we added the approx.limit parameter as well, as the size of the dataset, something which is optional. The result is a large ggplot object: accordingly, we can add labels and titles, assign themes, and combine it with other plots as well.
### Baplot for duplicated data library(aplpack)
Pest_bagplot <- compute.bagplot(Pest
For the Local Outlier Factor calculations, we used two different methods: first (what we labeled as “with duplicated data”), we included all the observations the data set had. However, for certain “k” parameters, our LOF score would result in being infinite. How can this happen? If we recall, “k” defines the parameter to which we calculate the density of neighborhoods. Say, for k
# adding random noise to coordinates Pest
for(h in 1:nrow(Pest)) noise=runif(n=1, min=0, max=0.0001) Pest
To calculate the LOF score, we used the package “Rlof” developed by [32]. The calculation takes all the variables included in the database and computes the LOF score for each data point with the desired “k” parameter. With our specific k parameter chosen, there seems to be some over-fitting, but it is just a couple of observations, negligible compared to the size of the dataset.
### calculating LOF for duplicated data library(Rlof) Pest$lof_score <- Pest select(area_with_noise01,price_with_noise01) lof(k=50)
Similarly, we did the LOF calculation for “non-duplicated data”, that is: we only calculated with distinct data, as we filtered out those observations which would cause the LOF score to end up being infinite in value. It is also important to note that due to the reduced number of observations, we had to change the “k” parameter to be lower. The results are very similar for both applications.
# calculating LOF for non-duplicate data Pest_noduplicate <- Pest %>% select(price_huf,area_size) %>% distinct(price_huf,area_size, .keep_all = TRUE) #calculate LOF; note that "k" has been changed, due to the reduced number of observations Pest_noduplicate$lof_score <- lof(Pest_noduplicate, k=30)
To calculate FBOD, we used the “HighDimOut” package developed by [33]. In this package, a number of other high-dimensional outlier detection methods are implemented, such as the Angle-based Outlier Detection (ABOD) [23]) and the subspace outlier detection (SOD) [34]. For the FBOD to be reproducible, we need to set a seed first, as it is a bagging-based algorithm. In this specific case, we choose an iteration of 100, and a “k.nn” parameter (which is analogous to LOF’s “k” parameter) of 200. For the latter one, there are methods for supervised learning to indicate the ideal number, but one can get relatively good results just by trial-and-error.
### Calculating FBOD for non-duplicate data library(HighDimOut) set.seed(123) Pest_noduplicate$FBOD_score <- Pest_noduplicate %>% select(area_size,price_huf) %>% Func.FBOD(iter = 100, k.nn = 200) #k.nn equals to "k" in the LOF-environment
Results of various outlier detection methods using dimensions.
Figure 3 shows the results of different outlier detection methods. Not counting the bagplot, in all cases, we filtered out 5% of the data. One drawback of using the bagplot (or using multiple boxplots) is that a large part of the data is flagged as outliers. However, bagplots are useful to identify the core observations of the data, around which the rest of the data is centered. The FBOD – due to it being iterative – shows a somewhat similar pattern, and can also be used to identify the core observations, but with much less data loss. Finally, looking at the results of the LOF method, two things can be highlighted. First, the LOF implementations show more outlying data as inliers, and a linear connection between the two dimensions is much clearer to identify. Second, there are only slight differences between using non-duplicated or duplicated data (where in the latter case, we used the data points with random noise added to them). Again, we do not aim to propose a “best solution”, because we believe that no such thing exists. Our goal is to demonstrate some of the tools available for outlier detection. Overall, all methods managed to filter out the obvious outliers in the dataset.
Another way to look at outliers is to look at the residuals produced when constructing a model. The smaller the residuals are, the better the fitted model is – and observations that deviate from other data points due to their greater residuals can be considered outliers. In the following example, we first fit a model, and then we filter out outliers based on Cook’s Distance. That is: we filter out those observations which have the greatest effect on the coefficients of the model, thus deviating from other observations. The function “cooks.distance” is included in the “base” package, so it requires no installation. To simplify, we did not include the complete linear model specification in the syntax. First, we construct a model used for the construction of Hedonic Price indices – a technique commonly used in housing statistics. Then, we filter out the top 5 percent of the distances we get from using the cooks.distance function. Finally, we fit the same model to the cleaned data.
# Get the Cook’s distance of a model model <- lm(log_pricehuf [model_specification], data=Pest_model) # add the distances to each variable Pest_model$cookssd <- cooks.distance(model) # look at the distribution values with quantiles quantile(Pest_model$cookssd, probs = c(0.95,0.99), na.rm = TRUE) # based on that, we can either look at the top 5Cooks_model_data <- Pest_model %>% mutate(cooksd_5 = case_when (cookssd>=0.0003575904 1, cookssd<0.0003575904 0)) filter(cooksd_5==0) # Rerun the regression Cooks_model <- lm(log_pricehuf [model_specification], data=Cooks_model_data)
We repeat this process using the LOF method introduced above.
# Outliers using LOF with duplicated data: quantile(Pest$lof_score_100, probs = c(0.95,0.975, 0.99)) lof_model_data_100 <- Pest mutate(lof_outlier= case_when (lof_score_100>=1.451301 1, lof_score_100<1.451301 0)) filter(lof_outlier==0) # additionally, you can select only those variables for which you want to run your regression # use only complete cases; if viable, data imputation can also be used now that the outliers are filtered out lof_model_data_100 <- lof_model_data_100[complete.cases(lof_model_data_100),] # we use the same model-specification as before: lof_model_100 <- lm(log_pricehuf [model_specification], data = lof_model_data_100)
Comparison of the distribution of residuals in models using Cook’s Distance and LOF for outlier detection.
To get the plot for the residuals, we can use the package “lindia”, which allows us to export the residuals from the model objects and create a histogram. We can use the package “gridExtra” to include all the residual plots in one plot.
# get distribution plots on the residuals library(lindia) library(gridExtra) res_cooks <- gg_reshist(Cooks_model, bins=100) res_lof <- gg_reshist(lof_model_100, bins=100)
Comparison of residuals vs. fitted values in models using Cook’s Distance and LOF for outlier detection.
Comparison of outliers in two dimensions.
As Fig. 4 shows, the two methods conduct similar outputs. However, some important points should be addressed. First, Cook’s Distance-based outlier detection was applied to a model specified in advance. That is: we factored in all the information we had available into the decision-making on distances. In the case of the LOF method, we only factored in the dependent variable (price) and the size of area (which is also included in the model as an independent variable). This might not sound much, but in the case of housing statistics, the area of the building is the most important factor when determining the price. This means that we can use only the area to get these similar results in model residuals.
Both of the methods used have strengths and weaknesses in application. Cook’s Distance is faster to compute, and relatively easy to interpret. However, in order to use it, we need to construct a model. For Local Outlier Factor, we do not need to construct a viable model, we only need to choose the variables we want to control for. However, LOF needs a lot more time and processing power to compute (especially with larger databases), and choosing the ideal “k” for the calculations is not straightforward (although we can use some of the methods introduced above besides the trial-and-error method). Interpreting the LOF scores is not trivial either, and if we have a lot of observations with the same values, we might get a LOF score that is infinite. That is why either we need to omit duplicate observations or use the technical solution of introducing a marginal noise to the variables in order to make them distinct. We also had a very strong predictor available, which made outlier detection easier for both cases.
Looking at fitted values plotted against residuals, it is clear that the starting model (using all data available) has some biases, which is indicated by the (marginally) positive regression slope. Controlling for outliers – in both cases –, we see that the correlation line becomes straight at zero. The difference between the two methods is that the residuals in the case of Cook’s Distance are clearly “cut” and are smaller in range, while in the case of LOF, one can still see some of the noisy observations.
The difference between Cook’s distance and LOF is that the residuals in the former are clearly “cut”, and are smaller in range, while in the latter, one can still see some of the noisy observations. This is, again, due to the fact that with the former method, we controlled for a lot more variables. One could argue, however, that using the LOF-algorithm results in a more natural dataset where some noise is still included. With the aim of official statistics being to observe the total population, this method might be a better choice for statistical offices.
One of the most difficult things about outlier identification is defining the threshold from which we can divide inliers and outliers. In other words: we need to decide where we draw the line for outliers. Some methods, like the boxplot, do this by definition. However, considering all the scoring methods, like LOF, ODIN, or FBOD, this line becomes blurred – especially since we cannot make universal rules for the range of acceptance. In the case of the Local Outlier Factor, [9] noted that there is no universal rule that is applicable, except that the closer the LOF-value of an observation is to 1, the more likely it is to be an inlier. This decision however is highly subjective, and by the method being so abstract, it is really hard to differentiate between a value of, say, 1.245 and 1.255.
Other aforementioned outlier detection methods such as the Local Outlier Probability (or LoOP) [26] try to address this by assigning probabilities concerning outliers to each variable. However, in essence, we face the same problem: from which odds do we consider observations to be outliers? Again, this is something that can change due to a lot of things, starting from the nature of the data to the scope of research. However, simple rules of thumb also exist to determine thresholds for outliers.
To demonstrate, we used Cook’s Distance and Local Outlier Factor and filtered out 5 and 1 percent of the data. We used a simple method to define outliers: these techniques are distance-based methods, we can look at the top x percent of the distances and mark them as outliers. This method is widely used, however, some refinements can still be made. For example, the methodology of the Hungarian National Bank’s housing price index uses three different outlier detection methods; if an observation is marked to be an outlier by at least two of the methods used, it is omitted from the model [35].
Figure 6 depicts the results from different methods and different levels of outliers. As mentioned before, we determined the threshold by identifying the value of Cook’s Distance or LOF score at the 95th and 99th percentile. Everything that exceeded these values was labeled as an outlier. At the one percent level, outliers were obvious, however, looking at the second graph, it would seem that some of the more inlying data points are being flagged as outliers. This comes from the nature of the technique: Cook’s Distance calculates the distance using a whole model. With that being the case, the observations flagged might not be outliers regarding the area of the building, however, they might be further away from other points in all the different dimensions we controlled for.
Summary
In this paper, we tried to further extend research on outlier detection by applying density and distance-based methods for data used in official statistics. We showed that these methods can be more meticulous and more precise when dealing with large databases, even in cases where only two variables were used. In a multi-variable environment, these novel methods can more precisely identify outliers, without omitting too many variables from the database – like in the case of boxplots and bagplots. Recently, there has been research conducted on outlier detection with the use of M-estimation in official statistics [36], however, to our knowledge, Local Outlier Factor and other density-based outlier detection methods were outside the scope of official statistical use. The analysis of extreme values could test whether these outlying observations are random, or they have a common mechanism that creates them, gaining a deeper understanding of the data collection process and the underlying market in question.
The methods introduced above, however, come with new challenges. These challenges include the choice of parameter “k” or “MinPts”, finding the threshold for LOF-values, duplicated data, and the complexity of the algorithm, which can lead to long computational time, especially considering large databases. In the past twenty years, however, variations as well as improvements to the original LOF algorithm has been introduced, which ease these aforementioned problems. We also tried to highlight these potential difficulties during the application and made some recommendations on choosing the thresholds, and dealing with duplicated data by using randomly assigned noise to the coordinates. Finally, we showed that with the appropriate parameters, we can use a naive LOF (using only two variables) for data cleaning just as effectively as if we were to use Cook’s distance – with the difference being that Cook’s distance required a model constructed in advance. As such, we do not need to use as many assumptions to come to approximately the same results when using the Local Outlier Factor method.
This paper concentrated on housing data, which is special in the sense that the area of a dwelling is a strong determinant of the price. In future research, it would be worth exploring the use of density-based methods in other areas of official statistics and testing the identification of extreme values. The author hopes that this article will inspire a deeper analysis of outliers in official statistics.
