Constructing outlier-free histograms with variable bin-width based on distance minimization

Abstract

We propose a new method of constructing a variable bin width histogram that can accommodate the unbalanced distribution of the samples yet retaining, as a whole, the good aspect of both equal width (EW) and equal-area (EA) histograms that are being used popularly for data visualization and analysis. We formulate this as an optimal change point detection problem in which the bin boundaries are determined by minimizing the sum of the absolute error or the squared error in each bin. The former is based on Distance Minimization (DM) and new, and the latter is based on Variance Minimization (VM) and is considered the state-of-the-art. The constructed histograms can effectively be used to detect and visualize hidden outliers/anomalies by applying the interquartile range method in each bin. The final histograms are obtained by adjusting bin boundaries and heights accordingly after removing the detected outliers/anomalies. We further propose a method to annotate the constructed bins if the data for annotation is given for each sample as a set of nominal variables, using $z$ -score with respect to their distribution within each bin. We applied our method to both real vinyl greenhouse datasets and two different sets of three synthetic datasets, and confirmed that both DM and VM methods work as intended, both can represent the sample distribution with a smaller number of bins than those by EW and EA methods, The use of interquartile range method can detect anomalies as well as outliers, and the terms selected for annotation are interpretable and reasonable. EW and EA methods have contrasting properties. DM and VM methods lie in between, but the former is closer to EA method and the latter to EW method. DM method runs substantially faster than VM method and performs slightly better than VM method in outlier detection and annotation tasks.

Keywords

Histogram variable bin width error minimization change point detection outlier detection

1. Introduction

Histograms have been widely used to visualize and analyze the distribution of data. They can highlight the properties of data and help improve their quality. They can be used to pre-process the data that are to be fed to other system using them, e.g., machine learning system. Histograms that divides the data into separate bins (also called buckets) have merits to naturally detect and visualize outliers within each bin as well as the data themselves, and characterize each bin by annotating it using other nominal features of the data. We focus on these aspects and propose a new method to construct outlier/anomaly-free histograms with variable bin width and visualize the constructed histogram together with the detected outliers/anomalies.

The most popular standard histogram construction method uses a fixed bin width. It is important that this bin width is set properly and various methods and indicators have been proposed for this purpose [2, 16, 18, 15, 5]. Whether the equal width histogram is appropriate or not depends heavily on the sample distribution. Data with a smooth distribution is easy and the use of an equal width histogram does not cause a problem. However, it is not appropriate to apply it to non-smooth unbalanced data in which some parts of data are very dense and its changes are abrupt from the neighbors. A simple way to cope with this shortcoming is to construct a histogram with equal area, i.e., equal sample size [2, 16]. However, we may miss some bins with zero width when there exist a large number of samples of identical values. In short, the equal width (EW) histogram works well for many real world datasets and the equal-area (EA) histogram is an alternative remedy when it does not work well. It is natural to conjecture that histograms that have variable bin width capability which is more robust than EA histograms can accommodate unbalanced datasets and perform better. Fisher [4] was the first to propose the concept of such histograms in which the bin boundaries are determined by minimizing the variance within each bin, which is referred to as Variance Minimization (VM) method in this paper. This idea is followed and investigated by [8, 9] and is now considered as the state-of-the-art.

Minimizing the variance is equivalent to minimizing the square error which is the L2 metric. This choice has been taken natural and no other metrics have been discussed. We note that the L1 metric is more robust to noise and outliers/anomalies, and conjecture that the L1 metric may work better than, if not, at least as good as, L2 metric. We thus propose to use the absolute error which is L1 metric and distance-based. This method is referred to as Distance Minimization (DM) method in this paper. We further propose to use a change point detection method to search for the best bin boundaries that can be applied to both measures for which we have abundant experience [14].

It is desirable that the constructed histograms still possess good properties of both equal width and equal area histograms. Here we mean by good properties that the widths of bins are almost the same and the numbers of samples in bins are almost the same across different bins. In fact, a bin with a too small width or a too small sample size is usually not worth analysis. We investigate how DM method differs from VM method and how these two are related to EW and EA methods, and demonstrate that these two properties are simultaneously achieved by minimizing the absolute error or the squared error of samples in each bin for its median or mean, based on the idea of combining histogram construction and clustering, which is the most important contribution in this work. In other words, our method constructs a histogram with variable bin width that 1) has a large number of narrow bins where the samples are densely distributed and drastically changes from their neighbors and a small number of wide bins where the samples are sparsely distributed, yet avoiding bins with too small widths and too small samples in them, and 2) shows a similar behavior of both equal width and equal area histograms where the changes are moderate.

The change point detection algorithm we use is applied to the samples that are arranged in ascending order of their numeric values. The algorithm minimizes a step function consisting of $K$ steps based on the absolute or squared error criterion. A histogram with $K$ variable width bins is constructed by using the found change points. What the change point detection does with respect to the error minimization is equivalent to one dimensional $K$ median clustering for the absolute error criterion and one dimensional $K$ means clustering for the squared error criterion. It is theoretically ensured that both measures construct an equal width and equal area histogram when the sample distribution is completely uniform and homogeneous.

In our previous paper [6], we have proposed a basic idea of the variable bin-width histogram construction with error minimization and visually compared the constructed histograms to EW histograms. In this paper, we further address outlier/anomaly detection problem because we note that outliers/anomalies that are hidden in the dataset can substantially distort the constructed histograms. Being unable to remove outliers/anomalies during the histogram construction process is one major limitation of the existing methods. Outlier/anomaly detection capability should be integrated into the histogram construction process. Evidently, we need some measure (score) to quantify their outlierness and anomality. We use the interquartile range method, which does not depend on sample distribution, to data samples within each bin of the histogram of this measure to detect them. When the samples are univariate which is the case in this paper, the histogram of the data themselves is used as the score. We then reconstruct the histogram after removing the detected outliers/anomalies. Thus, the final histogram is outlier/anomaly-free. One notable feature is that both detected outliers/anomalies and the reconstructed histogram are simultaneously visualized.

Here, outliers and anomalies are meant in general to be such data points that lie outside the main body or group of the dataset that they are part of. In this paper, we use outliers to indicate the data points which lie in the tails of a sampled distribution, e.g., data points that lie outside of 3 $\sigma$ from the mean if we assume that the distribution is normal and anomalies to indicate the data points which we know are anomalous for sure.1

¹
We occasionally omit anomaly to indicate both for the sake of simplicity, e.g., outlier-free instead of outlier/anomaly-free.

Another contribution in the previous paper is that we proposed a method to annotate each of the generated bins from a set of nominal data that may come with the numeric samples. For example, consider the case where each sample comes with the time stamp, temperature, and pressure that can characterize the sample. We can define a set of ranges/categories for each of these data and convert them to corresponding nominal data. We then compute $z$ -score based on their distribution in each bin, and use the top-ranked range/category as the term to characterize each bin. We extended the analysis and report the more detailed results in this paper.

We did experimental studies to test the validity of the proposed method using real datasets of humidity deficit (HD) and carbon dioxide concentration (CO2), both collected from a vinyl greenhouse in operation as well as two sets of synthetic datasets, with each dataset generated from one of the three sample distributions: uniform, exponential and normal. Each dataset in the first set uses a single distribution without anomalous data and the one in the second set uses well separated two distributions of the same kind with an anomalous sample inserted in the middle. The latter is used to test both the anomaly and outlier detection capability. We show that our methods can construct histograms which simultaneously achieve two good properties of equal width and equal area. We confirm that our proposed methods can construct histograms with appropriate various bin widths that can well represent the sample distribution with a smaller bin size $K$ and can detect both the anomalies and the outliers more reliably and accurately than the basic EW and EA methods. We found that the histograms constructed with the absolute error criterion (DM method) is closer to the ones constructed by EA method, and the histograms constructed with the squared error criterion (VM method) is closer to EW method. Both DM and VM methods have the same computational complexity in theory, but in reality DM method runs substantially faster. Further, we show that from the annotation experiment applied to the HD dataset the terms selected for annotation by both DM and VM method are interpretable and reasonable. Overall, DM method is comparable with VM method, but the former slightly outperforms the latter in outlier detection and annotation tasks.

Last, to make the difference from the conference paper clear, we summarize the major additions to this paper: 1) expansion of our framework to include outlier detection capability and to construct outlier-free histograms, 2) addition of datasets: another real-world dataset CO2, and three synthetic datasets, each with and without anomaly, 3) addition of the details of EA method, and reformulation of the proposed DM method and the state-of-the-art VM method as two instances of a minimization problem of general distortion function, 4) expansion of the quantitative evaluations by adding entropy analysis for the equal-sample-size property, computational efficiency with complexity analysis, homogeneity property analysis, and in depth comparative studies of EW, EA, VM, and DM methods, and 5) experimental evaluations of outlier/anomaly detection using synthetic datasets, 6) substantial expansion of the related work by summarizing it from four perspectives.

The paper is organized as follows. We describe related work in Section 2, the conventional methods in Section 3, give our problem setting and the proposed method for 1) constructing a variable bin width histogram, 2) detecting outliers within each bin and 3) selecting terms for annotation in Section 4. We report and discuss experimental results for both real world data and synthetic data in Section 5 and conclude this paper and address the future work in Section 6.

2. Related work

We summarize related work from four different aspects: histogram construction, clustering, change point detection, and outlier detection.

2.1 Histogram construction

Histograms measure and visualize the density of numeric variables that take continuous values. The bin widths are often set equal, the horizontal axis represents the range of sample values (bin), and the vertical axis represents the density of samples included in that range (bin). Indices that determine the number of bins or the width of bins, that is, the number of intervals needed to divide all samples into, include square root selection,2

²
Microsoft EXCEL uses this strategy.

Sturges’s formula [18], Scott’s normal reference rule [15], Freedman-Diaconis’ choice [5] and more. In many cases, samples are assumed to follow the normal distribution. Thus, care must be taken for samples that heavily deviate from the normal distribution.

Methods to construct histograms with variable bin width have been proposed [4, 16, 2], which bring the advantages of improving the robustness for noises and the accuracy of density estimation by assigning wider bins for low-density ranges and narrower bins for high-density ranges. Scott’s equal area histogram [16] adjusts the width of each bin so that each bin contains an equal number of samples. Denby and Mallows’ diagonally-cut histogram [2] is an intermediate histogram that generalizes an equal width histogram and an equal area histogram. It is obtained by dividing the samples by diagonally cutting the cumulative distribution of data. These histograms make it easier to identify peaks because the data is not over-smoothed, but at the same time, they are likely to include bins with too small widths and/or too few samples. On the contrary, the proposed method would rarely generate histograms with such extreme bins because it is based on the concept of clustering obtained by error minimization, and we can expect to have good properties of both equal width and equal area.

A series of works have been presented by Poosala and Ioannidis et al. who argued and emphasized that it is important to construct a proper histogram when estimating the size of query results in the query optimization problem in RDBS (Relational Database System). They pointed out the problem of classic histogram construction methods and defined the V-optimal histogram that minimizes the variance of the data in each bin in the histogram [8, 9]. Further, in their succeeding work [13, 12], they compared the V-optimal histogram and the histogram constructed by the existing method, and discussed the advantage and disadvantage of each histogram for artificial data with various distributions including a uniform distribution. Irpino and Romano noted that the V-optimal histogram method based on variance minimization of data in each bin is equivalent to the grouping method proposed by Fisher [4], and compared the Fisher’s algorithm with an algorithm for estimating the density distribution by piece-wise linear interpolation [10]. As mentioned in Section 1, our VM method is equivalent to the V-optimal histogram and Fisher methods in the spirit of the squared error minimization, and is implemented by using a different boundary search algorithm, i.e., change point detection algorithm. Thus, our proposed method provides a general framework that includes these two methods in that an optimal histogram can be constructed by minimizing either one of the two kinds of errors, the absolute or the squared.

2.2 Clustering

Clustering is the task of grouping a set of samples in such a way that samples in the same cluster are more similar to each other than to those in other clusters. This is a rich research field and has a long history. Many different clustering algorithms have been proposed and tested for their performance. These include hierarchical clustering, centroid-based clustering, distribution-based clustering, density-based clustering, subspace clustering, and more. Constructing a histogram is considered to be a special instance of clustering samples in one-dimensional consecutive bins. Our method, which is described in Section 4, uses the absolute or the squared error of samples in each bin for its median or mean as an objective function to be minimized, and the search is made of bin’s boundaries using the method developed for change point detection [14].

We notice that our method is doing centroid-based clustering in one dimension, and is equivalent to $K$ median clustering in case of the absolute error criterion and $K$ means clustering in case of the squared error criterion. $K$ mean/median clustering starts with initial K clusters, compute a centroid (mean/median) in each cluster, redistribute each sample to the closest centroid, recompute a new centroid in each cluster, and repeat this until it converges. Our method starts with initial K-1 change points (bin’s boundaries), compute the respective error in each bin and sum them over all bins, search for a better boundary to lower the error one at a time fixing the rest, recompute the error in each updated bin and sum the error over all bins, repeat this until all the bin boundaries do not move. A good point of our method is that both the absolute and the squared error criteria lead to bins with equal width and equal area when the samples are completely uniform and homogeneous. It gives variable-width bins for regions where samples are non-uniform (dense, sparse, big changes).

2.3 Change point detection

We view histogram construction as a problem of detecting changes in data. There are abundant studies on detecting changes in social media data [22, 21, 3, 1, 11, 19, 20]. In the spirit, our problem setting is related to the work by Kleinberg [11] and Swan and Allan [20]. They noted a huge volume of the stream data, tried to organize it and extract structures behind it. This is done in a retrospective framework, i.e., assuming that there is a flood of abundant data already and there is a strong need to understand it. Kleinberg’s work is motivated by the fact that the appearance of a topic in a document stream is signaled by a burst of activity and identifying its nested structure manifests itself as a summarization of the activities over a period of time, making it possible to analyze the underlying content much easier. He used a hidden Markov model in which bursts appear naturally as state transitions, and successfully identified the hierarchical structure of e-mail messages. Swan and Allan’s work is motivated by the need to organize a huge amount of information in an efficient way. They used a statistical model of feature occurrence over time based on hypotheses testing and successfully generated clusters of named entities and noun phrases that capture the information corresponding to major topics in the corpus and designed a way to nicely display the summary on the screen (Overview Timelines).

Our framework is different from these studies. We arrange samples in ascending order of their numeric values and regard this ordering as if they are time-stamped documents in stream data. We then, by regarding these numeric values as the activity levels, detect a set of intervals each of which contains activities of similar levels. For this, we use the change point detection method [14] which was originally developed to efficiently detect bursts from the observed information diffusion data and found to perform better than Kleinberg’s method. The most noticeable difference is that we summarize our detection results as a histogram with variable bin width, rather than timelines of bursty activities.

The histogram itself can be a useful means to detect changes in the data stream. Sebastião and Gama [17] proposed a method to monitor and compare histogram distributions from two different time windows in an online setting. The method uses two layer structures that fits well to construct a histogram from a high-speed data stream (Partition Incremental Discretization algorithm) and successfully detected changes in the distribution embedded in the streaming data where the KL divergence is used to measure the difference.

2.4 Outlier detection

Since a histogram represents the distribution of data, it can also be used to detect outliers. Goldstein and Dengel [7] proposed a method called HBOS (Histogram-based Outlier Score) that computes anomaly score for each instance of multivariate data assuming the independence of each variable (feature). Basically, the anomaly score of an instance is the product of the weighted anomaly score of each feature of the instance which is the inverse of its bin height in the histogram. The score is larger when the height is smaller. HBOS can detect global outliers as reliably as the state-of-the-art algorithms but it performs poorly for local outliers. It runs much faster than the cluster-based and nearest neighbor based algorithms. Our proposed method is different in two ways. First, we are not proposing a new anomaly score for multivariate variables. We propose to use the histogram of the score for both analysis and visualization, assuming that the score is available. Second, our method detects outliers by considering the score distribution within each bin, i.e. it is detecting local outliers in each bin. Evidently, when the data is univariate, the histogram of the data themselves is used as the score.

3. Conventional methods

We briefly revisit the conventional methods for constructing a histogram from a given set of samples described by a numeric variable, ${\cal X}=\{x_{t}∼{}|∼{}t=1,\ldots,T\}$ , where $T$ stands for the number of values (the number of samples, sample size) and these numeric values are arranged in ascending order so as to satisfy $x_{t}\leqslant x_{t+1}$ for each $t$ $(<T)$ , forming Empirical Cumulative Distribution of data. We first describe an algorithm for constructing the standard fixed bin width histogram for a predetermined number of bins denoted by $K$ . For a given set ${\cal X}$ , in order to assign each sample to one of $K$ bins, after setting the bin width $\delta$ to $\delta=(x_{T}-x_{1})/K$ , we produce the boundary-point for the $k$ -th bin as $F(k)=x_{1}+(k-1)\delta$ where $k\in\{1,\ldots,K\}$ . Here, by using an additional value, $F(K+1)=x_{T+1}$ , we consider a set of boundary-points defined by $\{F(1),\ldots,F(K+1)\}$ , where $x_{T+1}$ means a value slightly larger than $x_{T}$ . Finally, since the samples belonging to the $k$ -th bin are obtained as ${\cal X}_{k}=\{x_{t}∼{}|∼{}F(k)\leqslant x_{t}<F(k+1)\}$ , we can construct an equal width histogram as a line (probability density) function defined by

$\displaystyle h_{K}(s;k)=\frac{|{\cal X}_{k}|}{w_{k}T},∼{}∼{}\text{where}∼{}∼{% }F(k)\leqslant s<F(k+1).$ (1)

Here $w_{k}$ is the width of the $k$ -th bin: $w_{k}=F(k+1)-F(k)=\delta$ . Note that $\sum_{k=1}^{K}h_{K}(s;k)w_{k}=1$ . Hereafter, this method is referred to as EW (Equal Width) method. We can naturally conjecture that EW method may show quite poor performance in terms of the equal sample size property for some data, although its performance is optimal in terms of the equal width property. Namely, EW method may have a severe limitation when the distribution of values ${\cal X}$ contains both dense and sparse regions. We want to have a relatively large number of narrow bins for regions where the samples are dense and their values change substantially from the neighbors and a relatively small number of wide bins for regions where the sample are sparse while retaining good properties of the conventional histograms as a whole.

A simple way to cope with this shortcoming is to construct a histogram with equal area, i.e., equal sample size [2, 16]. For this purpose, we introduce an array denoted by $G(\cdot)$ for storing a list of sample index numbers whose values are used as bins’ boundary-points. More specifically, by using two additional values, $G(1)=1$ and $G(K+1)=T+1$ , just like EW method, we can define each value as $G(k+1)=G(k)+\lceil(T-(G(k)-1))/(K-(k-1))\rceil$ for $k\in\{1,\ldots,K-1\}$ . Then, we can construct an equal area histogram as a line function defined by

$\displaystyle h_{K}(s;k)=\frac{G(k+1)-G(k)}{(x_{G(k+1)-1}-x_{G(k)})T},∼{}∼{}% \text{where}∼{}∼{}x_{G(k)}\leqslant s\leqslant x_{G(k+1)-1}.$ (2)

Here the width of the $k$ -th bin is $w_{k}=x_{G(k+1)-1}-x_{G(k)}$ , and the set of samples belonging to the $k$ -th bin is obtained as ${\cal X}_{k}=\{x_{t}∼{}|∼{}G(k)\leqslant t\leqslant G(k+1)-1\}$ . Note that there is a slight difference in the definition of bin boundaries from that of EW method. Here, the right-side boundary of the $k$ -th bin is not necessarily the same as the left-side boundary of the $(k+1)$ -th bin. In general, this difference is negligible in drawing a histogram for the initial dataset, but as will be made clear later in Section 4.2, this definition is advantageous because we propose to redraw the histogram, once the outliers have been detected, by removing them and adjusting the bin boundaries and the height accordingly allowing an empty gap between the consecutive bins. Of course, this is not possible for EW to maintain the equal width property. Hereafter, this method is referred to as EA (Equal Area) method. We can naturally conjecture that EA method may show quite poor performance in terms of the equal width property for some data, although its performance is optimal in terms of the equal sample size property. EA method has a severe limitation when the number of identical values in ${\cal X}$ is greater than $|{\cal X}|/K$ , which can produce an undrawn bin with zero width.

A more sophisticated way to cope with these shortcomings is to construct a histogram with variable bin widths by dividing samples into $K$ bins so as to minimize total variances for these bins, i.e., variance minimization [4, 8, 9]. Let ${\cal G}_{K-1}=\{G(2),\ldots,G(K)\}$ be a set of possible instantiations for the list $G$ such that $G(1)<G(k)<G(k+1)<G(K+1)$ for $k\in\{2,\ldots,K-1\}$ , then we attempt to compute ${\cal G}_{K-1}$ so as to minimize the following objective function,

$\displaystyle\ell^{2}_{K-1}({\cal G}_{K-1})=\frac{1}{T}\sum_{k=1}^{K}\sum_{t=G% (k)}^{G(k+1)-1}(x_{t}-\mu(G(k),G(k+1)-1))^{2}.$ (3)

Then, from the obtained ${\cal G}_{K-1}$ , similarly to EA method, we can construct a variance minimization histogram as a line function defined by

$\displaystyle h_{K}(s;k)=\frac{G(k+1)-G(k)}{(x_{G(k+1)-1}-x_{G(k)})T},∼{}∼{}% \text{where}∼{}∼{}x_{G(k)}\leqslant s\leqslant x_{G(k+1)-1}.$ (4)

Here $\mu(G(k),G(k+1)-1)$ stands for a mean for $\{x_{t}∼{}|∼{}G(k)\leqslant t\leqslant G(k+1)-1\}$ . Hereafter, this method is referred to as VM (Variance Minimization). In this paper, we present a general framework that naturally derives this VM approach, and then propose a method based on a change-point detection algorithm.

4. Proposed method

We propose new methods for 1) constructing a histogram with variable bin width which is equipped with outlier detection capability for a given set of samples represented by a numeric variable, and 2) characterizing the numeric samples in bins by annotating the bin using terms in the nominal variables if each numeric variable comes with a set of nominal variables. These two tasks are separate. If only a numeric variable is given and no nominal variables accompany it, only the histogram is constructed from the numeric variables, outliers are detected, the final histogram is obtained by removing detected outliers and adjusting bin boundaries and heights, and the annotation part is skipped.

4.1 Histogram construction

For a given data set ${\cal X}=\{x_{1},\ldots,x_{T}\}$ and the number of bins $K$ , we propose a method for constructing a histogram with variable bin width, which exploits a change-point detection algorithm based on an absolute or squared error criterion. Recall that $G(k)$ is a sample index such that $G(1)=1$ , $G(K+1)=T+1$ , and $1<G(k)<G(k+1)<T$ for $k\in\{2,\ldots,K-1\}$ , and we define a set of possible instantiations as ${\cal G}_{K-1}=\{G(2),\ldots,G(K)\}$ which is also regarded as a set of change points in this paper. Then, we present our general framework formulated by the following objective function i.e., a weighted sum of distortion terms to be minimized with respect to ${\cal G}_{K-1}$ .

$\displaystyle\ell_{K-1}({\cal G}_{K-1})=\sum_{k=1}^{K}\phi_{k}d(G(k),G(k+1)-1),$ (5)

where $\phi_{k}$ and $d(s,t)$ stands for a positive weight and an arbitrary distortion term for $\{x_{s},\ldots,x_{t}\}$ , respectively. In case of VM (Variance Minimization) method mentioned earlier, we define these distortion terms as the variances as follows,

$\displaystyle d(G(k),G(k+1)-1)=\frac{1}{G(k+1)-G(k)}\sum_{t=G(k)}^{G(k+1)-1}(x% _{t}-\mu(G(k),G(k+1)-1))^{2}.$ (6)

where $\phi_{k}$ in Eq. (5) is instantiated to $1/(G(k+1)-G(k))$ and recall that $\mu(G(k),G(k+1)-1)$ stands for a mean for $\{x_{t}∼{}|∼{}G(k)\leqslant t\leqslant G(k+1)-1\}$ . Here we can easily derive Eq. (3) by substituting Eq. (6) into Eq. (5). On the other hand, under this general framework, we propose a new method, named DM (Distance Minimization) method that uses a different measure by defining these terms as the average distances based on absolute error as follows,

$\displaystyle d(G(k),G(k+1)-1)=\frac{1}{G(k+1)-G(k)}\sum_{t=G(k)}^{G(k+1)-1}|x% _{t}-\nu(G(k),G(k+1)-1)|,$ (7)

where the instantiation to $\phi_{k}$ is the same as above and $\nu(G(k),G(k+1)-1)$ stands for a median for $\{x_{t}∼{}|∼{}G(k)\leqslant t\leqslant G(k+1)-1\}$ . Here note that in case of one dimensional variable, Minkowski’s $p$ -distance reduces to the absolute error, i.e., $(|x_{s}-x_{t}|^{p})^{1/p}=|x_{s}-x_{t}|$ . Also, we note that the objective functions obtained by substituting Eqs (7) and (6) into Eq. (5) are equivalent to the mean absolute error MAE and the mean squared error MSE, respectively, which are used as the evaluation criteria in our experiments shown later.

In what follows, we propose a unified solution method based on the change-point detection algorithm both for DM and VM methods. The overview of our algorithm is as follows.

[!t] : Variable Bin width Histogram[1] Input: A numeric data set ${\cal X}=\{x_{1},\ldots,x_{T}\}$ , the number of bins $K$ Output: A variable bin width histogram $h_{K}(s;k)$ Initialize: Sort the elements of ${\cal X}$ in ascending order so as to satisfy $x_{t}\leqslant x_{t+1}$ , and set $G(1)=1,G(K+1)=T+1$ . Find change points ${\cal G}_{K-1}=\{1<G(k)<T+1;∼{}∼{}2\leqslant k\leqslant K\}$ by minimizing the objective function $\ell^{1}_{K-1}({\cal G}_{K-1})$ or $\ell^{2}_{K-1}({\cal G}_{K-1})$ Construct a histogram $h_{K}(s;k)\leftarrow(G(k+1)-G(k))/(x_{G(k+1)}-x_{G(k)})T,∼{}∼{}x_{G(k)}% \leqslant s\leqslant x_{G(k+1)-1}$

In Algorithm 4.1, ${\cal G}_{K-1}$ is a set of change points, $\ell^{1}_{K-1}({\cal G}_{K-1})$ and $\ell^{2}_{K-1}({\cal G}_{K-1})$ are the objective functions used to find the change points. Note that the number of change points is $K-1$ . The details of the algorithm are as follows.

As mentioned above, we use a step function to minimize the sum of errors with respect to ${\cal X}$ . For this purpose, by employing either the absolute or the square error as the standard error criterion, we can derive two different change point detection methods, which are simply referred to as DM or VM methods, respectively. First, we consider the case that there is no change point, which means that the sum of the errors is minimized by only one value. Then, in case of DM method, the absolute error $\ell^{1}_{0}=T^{-1}\sum_{t=1}^{T}|x_{t}-\nu(1,T)|$ is minimized by using the median value $\nu(1,T)$ , where the median function $\nu(a,b)$ is defined by $\nu(a,b)=x_{(a+b)/2}$ if $a+b$ is even; otherwise $(x_{\lfloor(a+b)/2\rfloor}+x_{\lfloor(a+b)/2\rfloor+1})/2$ , Here note that we can efficiently obtain $\nu(a,b)$ in case that the values in ${\cal X}$ are sorted in advance. On the other hand, in case of VM method, the squared error $\ell^{2}_{0}=T^{-1}\sum_{t=1}^{T}(x_{t}-\mu(1,T))^{2}$ is minimized by using the mean value $\mu(1,T)$ , where the mean function $\mu(a,b)$ is defined by $\mu(a,b)=(b-a+1)^{-1}\sum_{t=a}^{b}x_{t}$ .

Next, we consider the case that there exists only one change point expressed by a sample index $\tau$ . Then, the following formulae respectively minimize the absolute error $\ell^{1}_{1}(\tau)$ and the squared error $\ell^{2}_{1}(\tau)$ :

$\displaystyle\ell^{1}_{1}(\tau)=\frac{1}{T}\sum_{t=1}^{\tau-1}|x_{t}-\nu(1,% \tau-1)|+\frac{1}{T}\sum_{t=\tau}^{T}|x_{t}-\nu(\tau,T)|.$ (8) $\displaystyle\ell^{2}_{1}(\tau)=\frac{1}{T}\sum_{t=1}^{\tau-1}(x_{t}-\mu(1,% \tau-1))^{2}+\frac{1}{T}\sum_{t=\tau}^{T}(x_{t}-\mu(\tau,T))^{2},$ (9)

Evidently, we also need to minimize $\ell^{h}_{1}(\tau)$ with respect to $\tau$ , where $h\in\{1,2\}$ . Here we discuss the computational complexity of our proposed method for obtaining the following optimal change point:

$\displaystyle\hat{\tau}=\mathop{\text{missing}}{argmin}\limits_{\tau\in\{1,% \ldots,T\}}\ell^{h}_{1}(\tau).$ (10)

In case of the absolute error, after computing $\ell^{1}_{1}(1)$ with the computational complexity of $O(T)$ , we can successively compute $\ell^{1}_{1}(\tau)$ from $\tau=2$ to $T$ by using the following update formula with the computational complexity of $O(1)$ .

$\displaystyle\ell^{1}_{1}(\tau+1)=\ell^{1}_{1}(\tau)+\frac{2x_{\tau+1}-\nu(1,% \tau+1)-\nu(\tau+1,T)+\eta_{L}(1,\tau)-\eta_{R}(\tau+1,T)}{T}$ (11)

where $\eta_{L}(a,b)=\nu(a,b+1)-\nu(a,b)$ if $a+b$ is odd; otherwise 0; and $\eta_{R}(a,b)=\nu(a+1,b)-\nu(a,b)$ if $a+b$ is odd; otherwise 0. Therefore, we can see that the optimal $\hat{\tau}$ defined in Eq. (10) can be obtained with the total computational complexity of $O(T)$ . On the other hand, in case of the squared error, we can transform Eq. (9) as follows:

$\displaystyle\ell^{2}_{1}(\tau)=\frac{1}{T}\sum_{t=1}^{T}x_{t}^{2}-\frac{(\tau% -1)\times\mu(1,\tau-1)^{2}+(T-\tau+1)\times\mu(\tau,T)^{2}}{T}.$ (12)

We can similarly compute $\ell^{2}_{1}(1)$ with the total computational complexity of $O(T)$ , and successively compute $\ell^{2}_{1}(\tau)$ from $\tau=2$ to $T-1$ by updating each pair of $\mu(1,\tau)$ and $\mu(\tau+1,T)$ with the computational complexity of $O(1)$ . Therefore, we can see that the optimal $\hat{\tau}$ defined in Eq. (10) can be obtained with the total computational complexity of $O(T)$ , as in the case of the absolute error.

Now, we generalize these error functions, $\ell^{2}_{1}(\tau)$ and $\ell^{1}_{1}(\tau)$ . Namely, in case that the number of change points is $K-1$ which is the case where we have $K$ bins, let $G(k)$ be a sample index which corresponds to the $k$ -th change point. Again, by using two additional indices, $G(1)=1$ and $G(K+1)=T+1$ , we can consider a set of sample indices defined by ${\cal G}_{K-1}=\{G(1),\ldots,G(K+1)\}$ . Then, we can express the generalized error functions for $\ell^{1}_{K-1}({\cal G}_{K-1})$ and $\ell^{2}_{K-1}({\cal G}_{K-1})$ as follows:

$\displaystyle\ell^{1}_{K-1}({\cal G}_{K-1})=\frac{1}{T}\sum_{k=1}^{K}\sum_{t=G% (k)}^{G(k+1)-1}|x_{t}-\nu(G(k),G(k+1)-1)|.$ $\displaystyle\ell^{2}_{K-1}({\cal G}_{K-1})=\frac{1}{T}\sum_{k=1}^{K}\sum_{t=G% (k)}^{G(k+1)-1}(x_{t}-\mu(G(k),G(k+1)-1))^{2}$

Therefore, we can formalize our change point detection problem as the minimization problem of $\ell_{K-1}({\cal G}_{K-1})$ with respect to ${\cal G}_{K-1}$ . In order to obtain ${\cal G}_{K-1}$ , we employ an efficient local improvement algorithm described in [14]. After obtaining the set of sample indices, ${\cal G}_{K-1}$ , as a line function, we can construct the following histogram with variable bin width:

$\displaystyle h_{K}(s;k)=\frac{G(k+1)-G(k)}{(x_{G(k+1)-1}-x_{G(k)})T},∼{}∼{}% \text{where}∼{}∼{}x_{G(k)}\leqslant s\leqslant x_{G(k+1)-1}.$ (13)

Here the width of the $k$ -th bin is $w_{k}=x_{G(k+1)-1}-x_{G(k)}$ , and the set of samples belonging to the $k$ -th bin is obtained as ${\cal X}_{k}=\{x_{t}∼{}|∼{}G(k)\leqslant t\leqslant G(k+1)-1\}$ Here, by fixing the $K-2$ change points in ${\cal G}_{K-2}$ , we consider optimally adding only one change point to one of the $K-1$ intervals defined by $\{\{G(k),\ldots,G(k+1)-1\}∼{}|∼{}k\in\{1,\ldots,K-1\}\}$ , where recall that $G(1)=1$ and $G(K)=T+1$ . We note that the optimal change point within each interval can be obtained with the computational complexity of $O(G(k+1)-G(k))$ , and thus, we can easily see that the total computational complexity becomes $O(\sum_{k=1}^{K-1}(G(k+1)-G(k)))=O(T)$ . Let $N$ be the number of iterations for repeatedly examining the optimality with respect to each of the $K-1$ change points by fixing the remaining $K-2$ points. Then, the computational complexity of our proposed method to optimize $\ell^{h}_{K-1}({\cal G}_{K-1})$ based on the local improvement strategy is $O(NKT)$ for both the absolute and the squared error cases.

4.2 Outlier detection

By virtue of our variable bin width histogram, we can consider adjusting the bin width by detecting the outliers and separately plotting them within each bin. The idea of this adjustment can be viewed as an implementation of our assumption that the samples within each bin should be distributed as uniformly as possible. We use interquartile range method which does not assume that the samples follow the normal distribution as a mean to detect the outliers, and apply it to the data samples within each bin separately.

Let ${\cal V}=\{v_{u}∼{}|∼{}u=1,\ldots,U\}\subset{\cal X}$ be a set of samples within a bin, where $U=|{\cal V}|$ is the number of samples which fall in this bin and we assume $v_{u}\leqslant v_{u+1}$ for each $u$ $(<U)$ . We then compute the interquartile range denoted by $I Q R$ ,

$\displaystyle\textit{IQR}=v_{(U-\lceil U/4\rceil)}-v_{(\lceil U/4\rceil)}.$ (14)

Thus, we can identify a set of outlier samples ${\cal V}_{O}$ based on the interquartile range method with a parameter $\alpha$ , which we set to 1.5 as is widely used in IQR-based box plots. This corresponds to regarding samples that are more than 2.61 $\sigma$ away from the average as outliers when the samples follow the normal distribution. We have used the same value throughout all bins in this study. The use of domain knowledge may help assign different values to different bins, but whether this is possible and work well needs additional investigation and is beyond the scope of the current study.

$\displaystyle{\cal V}_{O}=\{v_{u}\in{\cal V}∼{}|∼{}v_{u}<v_{(\lceil U/4\rceil)% }-\alpha IQR∼{}\vee∼{}v_{u}>v_{(U-\lceil U/4\rceil)}+\alpha IQR\}.$ (15)

Namely, in our proposed method equipped with the above outlier detection method, we adjust the bin width and the height from $\max({\cal V})-\min({\cal V})$ to $\max({\cal V}_{N})-\min({\cal V}_{N})$ and from $|{\cal V}|$ to $|{\cal V}_{N}|$ , respectively, and plot each outlier sample in ${\cal V}_{O}$ separately as a dot, where ${\cal V}_{N}$ means the set of normal samples obtained by ${\cal V}_{N}={\cal V}\setminus{\cal V}_{O}$ . In other words, we set the boundary values of the bin as $\min({\cal V}_{N})$ and $\max({\cal V}_{N})$ . Note that we can also apply the above outlier detection framework to EW and EA methods, and plot each outlier sample in ${\cal V}_{O}$ separately as a dot. However, in case of EW method, we cannot adjust each bin width due to its intrinsic equal-width property, while in the case of EA method, in addition to the adjustment of each bin width, we also need to change the bin height from $|{\cal V}|/(\max({\cal V})-\min({\cal V}))$ to $|{\cal V}_{N}|/(\max({\cal V}_{N})-\min({\cal V}_{N}))$ to ensure its intrinsic equal-area property. Finally, it should be emphasized that we can arbitrarily substitute the above interquartile range method to any other advanced outlier detection method in our framework.

4.3 Annotation generation

We assign annotation terms for the obtained bins from a set of the nominal variables associated with the numeric variable, once the histogram has been constructed. Let ${\cal Y}^{(i)}=\{y_{t}^{(i)}∼{}|∼{}t=1,\ldots,T\}$ be a set of nominal values for the $i$ -th variable, where $i\in\{1,\ldots,I\}$ , $I$ denotes the number of variables. Thus, each sample with a numeric value $x_{t}$ has the corresponding nominal variables $\{y_{t}^{(1)},\ldots,y_{t}^{(I)}\}$ . We further assume that each nominal variable has only one of $J^{(i)}$ categories identified by a positive integer taken from $1$ to $J^{(i)}$ , i.e., $y_{t}^{(i)}\in\{1,\ldots,J^{(i)}\}$ .

For a pair of the $i$ -th variable and its category $j\in\{1,\ldots,J^{(i)}\}$ , we can define the set of samples such that $y_{t}^{(i)}=j$ by ${\cal T}^{(i,j)}=\{t∼{}|∼{}y_{t}^{(i)}=j\}$ , and compute the empirical probability $p^{(i,j)}$ by $p^{(i,j)}=|{\cal T}^{(i,j)}|/T$ . For the $k$ -th bin of the obtained histogram, we can compute the expected number of samples such that $y_{t}^{(i)}=j$ and its standard deviation by $p^{(i,j)}|{\cal T}_{k}|$ and $\sqrt{p^{(i,j)}(1-p^{(i,j)})|{\cal T}_{k}|}$ . Thus, we can compute the following $z$ -score $z_{k}^{(i,j)}$ to samples with $y_{t}^{(i)}=j$ in the $k$ -th bin.

$\displaystyle z_{k}^{(i,j)}=\frac{|{\cal T}^{(i,j)}\cap{\cal T}_{k}|-p^{(i,j)}% |{\cal T}_{k}|}{\sqrt{p^{(i,j)}(1-p^{(i,j)})|{\cal T}_{k}|}}.$ (16)

In case that the $z$ -score $z_{k}^{(i,j)}$ is substantially large, we can consider that the $i$ -th variable with the $j$ -th category characterizes the $k$ -th bin. In our proposed method, for a predetermined number $H$ , we output the top- $H$ pairs of the $i$ -th variable and $j$ -th category as an annotation term to the $k$ -th bin, according to the $z$ -score $z_{k}^{(i,j)}$ .

5. Experimental evaluations

We have conducted experiments using both the real-world datasets (distribution unknown) and the synthetic datasets generated with known distributions to evaluate the proposed method, in particular, from the following five perspectives: 1) Computational efficiency (Section 5.2), 2) Equal width and equal area properties of constructed histograms (Section 5.3), 3) Homogeneity property within each bin (Section 5.4), 4) Appropriateness of visualized histograms and outlier detection (Section 5.5), and 5) Appropriateness of annotation (Section 5.6). In particular, perspectives 2) and 3) are the keys to achieving the desirable properties of the constructed histograms.

5.1 Datasets and settings

In our experiments, we used the real environmental datasets obtained from a vinyl greenhouse of a rose farmer in Japan. In this paper, we employed the humidity deficit (HD) [ $g/m^{3}$ ], and the CO2 concentration (CO2) [ppm], as the numeric variables. The HD is an indicator of how much water vapor can be contained at a particular temperature and humidity. Controlling HD within a specific range is considered important for the growth of agricultural products, therefore the values tend to concentrate within a specific range. The IoT device we used does not have a sensor that measures HD directly, so it was calculated using the following formula:

$\displaystyle\text{HD}=(100-\textit{Humi})*\frac{217*\frac{6.1078*10^{\frac{7.% 5*\textit{Temp}}{\textit{Temp}+237.3}}}{\textit{Temp}+273.15}}{100},$ (17)

where Humi and Temp represent humidity and Celsius temperature, respectively, both of which are measurable and we used their recorded values. HD values of $3$ to $6g/m^{3}$ are considered to be optimal.

The Temp and Humi data are measured at every 5 min., i.e., 288 data points per day. We used the data measured from 00:00 on March 27, 2018 to 23:55 on May 7, 2018, i.e., data for 42 days, to compute HD, which amounts to 12,096 values each. The CO2 data are measured at every 10 min., i.e., 144 data points per day. We used the data measured from 00:00 on June 22, 2019 to 23:50 on October 16, 2019, i.e., data for 117 days, which amounts to 16,848 data points. We confirmed that there were no device failure, thus there were no missing values to all of the above data. We did not do any pre-processing to clean the data. Thus, the data may include the measurement noise, but we know it is small and ignored its effect. We employed the humidity (Humi), the temperature (Temp) and the time-window (Hour) at each time point as the nominal variables to annotate HD histograms. Therefore, the data used for annotation consist of the numeric value $x_{t}=\textit{HD}$ and the three nominal values ( $I=3$ ), i.e., $y_{t}^{(1)}=\textit{Humi}$ , $y_{t}^{(2)}=\textit{Temp}$ , $y_{t}^{(3)}=\textit{Hour}$ . Each nominal variable was discretized into several intervals according to its value, i.e., 10 values for Humi from “Humi0-10” to “Humi90-100”, 10 values for Temp from “Temp0-5” to “Temp45-”, and 8 values for Hour form “Hour0-3” to “Hour21-24”, where each postfix means the range of values, i.e., $J^{(1)}=10,J^{(2)}=10,J^{(3)}=8$ .

In addition to the above real datasets, we used two sets of synthetic data based on probability distributions. The first set is based on unimodal distributions, which includes the uniform distribution on the interval [20, 80], the exponential distribution with the parameter $\lambda=$ 0.1 supported on the interval $[20,\infty)$ , and the normal distribution with the mean $\mu=$ 50 and the standard deviation $\sigma=$ 10, which we call Uni-Unif, Uni-Expo, and Uni-Norm, in this order. In our experiments, we generated 10,000 sample values for each dataset to make the size comparable to that of the real datasets.

The other set is based on bimodal distributions, which includes a mixture of two uniform distributions on the intervals [20, 80] and [120, 180], a mixture of two exponential distributions with the parameter $\lambda=$ 0.1 supported on the intervals $[20,\infty)$ and $[120,\infty)$ , and a mixture of two normal distributions with the means $\mu_{1}=$ 50, $\mu_{2}=$ 150, and the standard deviations $\sigma_{1}=\sigma_{2}=$ 10, which we call Bi-Unif, Bi-Expo, and Bi-Norm, in this order. In the bimodal datasets, each modal has 10,000 sample values. In addition, we added a sample with $\text{value}=$ 100 as an anomaly sample. Therefore, the number of samples including an anomaly value in each bimodal datasets is 20,001.

According to Sturges’ formula, the appropriate number of bins for the fixed bin-width histogram becomes $\lceil\log_{2}T+1\rceil\approx$ 15 for $T=10,000\sim$ 20,000. Thus, we chose to use the number of bins $K\in\{4,8,16,32\}$ in our experiments.

5.2 Evaluation of computational efficiency

Figure 1.

Computational efficiency of our proposed method.

We report the computational efficiency of our DM and VM methods. Figure 1 shows the results of processing time, where Fig. 1(a)–(c), correspond to the real, synthetic unimodal, and synthetic bimodal datasets, respectively. Our programs are written in C, and run on a computer with Xeon X5690 3.47 GHz CPUs using a single thread within a 192 GB main memory capacity.

From the experimental results, we see that the processing times increase almost linearly with respect to $K$ . Recall that both methods work with the computational complexity of O(NKT), where $N$ is the number of iterations for local improvement. These results suggest that the number $N$ was almost a constant within this range of $K$ . However, we note that DM method works substantially faster (about two or three times) than VM method, in virtue of the effective update formula shown in Eq. (11). We also note that the HD dataset needed more time than the CO2 dataset although the number of samples of the former is smaller than the latter. This indicates that the number $N$ of local improvement iterations naturally depends on the data distribution in the datasets. Further, we observe as shown in Fig. 1(b) and 1(c), that the processing times are quite comparable regardless of the types of basic distributions, and the respective processing times for the synthetic unimodal datasets and the synthetic bimodal datasets are almost the same despite that the number of samples of the latter is twice as large as that of the former. The second observation could be explained that the numbers of local improvement iterations for the latter are smaller than the former because the change points between the well-separated distributions can be easily detected in the case of the synthetic bimodal datasets. In summary, we can say that the change point detection algorithm works well for different kinds of data distributions and the proposed method is scalable as is predicted by the complexity analysis.

5.3 Evaluation of equal width and equal area properties

Figure 2.

Evaluation in terms of equal width property.

Figure 3.

Evaluation in terms of equal sample size property.

We show to what extent our histograms have the equal width and the equal area properties for various data distributions. We compute the entropy for the width of each bin for the constructed histogram $h_{K}(\cdot)$ to quantitatively evaluate the equal width property, which means that the data range is divided into a fixed number of bins without having too narrow or too wide bins:

$\displaystyle E_{\textit{width}}(h_{K}(\cdot))=-\sum_{k=1}^{K}\frac{w_{k}}{W}% \log\frac{w_{k}}{W}$ (18)

where $w_{k}$ is the width of the $k$ -th bin, and $W=\sum_{k=1}^{K}w_{k}$ . Equation (18) takes the maximum value $\log K$ when $w_{k}=W/K$ . Thus, we use the relative entropy in stead of the direct entropy to see how far each method deviates from the completely equal width property in which case RE takes the value 1.0:

$\displaystyle\textit{RE}_{\textit{width}}(h_{K}(\cdot))=\frac{E_{\textit{width% }}(h_{K}(\cdot))}{\log K}.$ (19)

The result of quantitative evaluation based on the relative entropy of Eq. (19) is shown in Fig. 2, where the horizontal and the vertical axes stand for the number of bins and the relative entropy $\textit{RE}_{\textit{width}}(h(\cdot))$ , on log-linear scale respectively. We observe that the histograms constructed by both DM and VM methods have the desirable property because their scores are much larger than that of EA method. We note that VM method has larger entropy than DM method for all the $K$ ranges and for all the datasets except Bi-Unif for which both are nearly the same in $K\leqslant 15$ with the score close to 1.0. The results imply that VM method is closer to EW method than DM method, meaning that the constructed histograms are closer to equal bin-with histogram. Note that in the case of Uni-Unif all the four methods attain the highest score, as is ensured by the error minimization, i.e., all the histograms converge to EW histogram when the data distribution is completely uniform.3

Constructed histograms are not exactly the same because of the limited sample size (10,000 samples).

Similarly, in order to quantitatively confirm the equal area property, which means that the samples are not assigned to particular bins very unevenly, we compute the relative entropy for the number of samples in each bin defined by the following equation for the constructed histogram $h_{K}(\cdot)$ :

$\displaystyle\textit{RE}_{\textit{area}}(h_{K}(\cdot))=-\frac{\sum_{k=1}^{K}% \frac{|{\cal X}_{k}|}{T}\log\frac{|{\cal X}_{k}|}{T}}{\log K}.$ (20)

The result of quantitative evaluation based on the relative entropy of Eq. (20) is shown in Fig. 3, where the horizontal and the vertical axes stand for the number of bins and the relative entropy $\textit{RE}_{\textit{area}}(h(\cdot))$ on the log-linear scale, respectively. We confirm that the histogram constructed by both DM and VM methods have the desirable property because their scores are much larger than that of EW method. Here we note that differently from $\textit{RE}_{\textit{width}}(h(\cdot))$ DM method has larger entropy than VM method for all the $K$ ranges and for all datasets. This implies that DM method is closer to EA method than VM method, meaning that the constructed histograms are closer to equal sample size histogram. By the same reason, all the methods attain the highest score 1.0 in the case of Uni-Unif.

Each proposed entropy is a good measure to assess the intended property. EW histogram has the lowest entropy for size (Eq. (20)) and EA histogram has the lowest entropy for width (Eq. (19)), while DM and VM histograms are in between with DM higher and closer to EA and VM higher and closer to EW. This holds for both the real and the synthetic data. In summary DM and VM methods have both the equal width property and the equal-area property. DM method is closer to EA method and VM method is closer to EW method.

5.4 Evaluation of homogeneity property

Figure 4.

Evaluation of histogram based on mean absolute error.

Figure 5.

Evaluation of histogram based on mean squared error.

We want samples with similar values to be allocated into the same bin and report to what extent this is true. For this purpose we evaluate both the mean absolute error MAE from its median $\nu({\cal X}_{k})$ and the mean squared error MSE from its mean $\mu({\cal X}_{k})$ for samples in each bin ${\cal X}_{k}$ defined by the following equation for the constructed histogram $h_{K}(\cdot)$ :

$\displaystyle\textit{MAE}(h_{K}(\cdot))=\sum_{k=1}^{K}\frac{|{\cal X}_{k}|}{T}% \sum_{x\in{\cal X}_{k}}\frac{1}{|{\cal X}_{k}|}|x-\nu({\cal X}_{k})|=\frac{1}{% T}\sum_{k=1}^{K}\sum_{x\in{\cal X}_{k}}|x-\nu({\cal X}_{k})|.$ (21) $\displaystyle\textit{MSE}(h_{K}(\cdot))=\sum_{k=1}^{K}\frac{|{\cal X}_{k}|}{T}% \sum_{x\in{\cal X}_{k}}\frac{1}{|{\cal X}_{k}|}(x-\mu({\cal X}_{k}))^{2}=\frac% {1}{T}\sum_{k=1}^{K}\sum_{x\in{\cal X}_{k}}(x-\mu({\cal X}_{k}))^{2}.$ (22)

Clearly, the errors $\textit{MAE}(h_{K}(\cdot))$ and $\textit{MSE}(h_{K}(\cdot))$ in Eqs (22) and (22) are the same with the objective functions of DM and VM methods, and the smaller these errors are, the more homogeneous each bin is. Figures 4 and 5 show the results of quantitative evaluation based on the normalized errors, $\textit{MAE}(h_{K}(\cdot))/\textit{MAE}(h_{1}(\cdot))$ and $\textit{MSE}(h_{K}(\cdot))/\textit{MSE}(h_{1}(\cdot))$ , for DM, VM, EW, and EA methods, where $\textit{MAE}(h_{1}(\cdot))$ and $\textit{MSE}(h_{1}(\cdot))$ are the errors for the entire dataset in case of using only one bin. By employing the normalized error, we can see how the error decreases in the same scale for all the datasets as the number of bins increases. In each figure, the horizontal and the vertical axes respectively show the number of bins and the error on the log-log scale.

From these figures, we see that the error decreases as the number of bins increases, which we can expect because a larger number of bins implies more homogeneous samples in each bin for all the methods. Naturally, the best result for MAE is attained by DM method and for MSE by VM method, but the difference between DM and VM is small for both MAE and MSE. Overall, the errors in EW and EA are much larger than the errors in DM and VM for both MAE and MSE. EW is the worst for MAE and EA is the worst for MSE. This supports the observation that DM is closer to EA and VM is closer to EW. We further observe that the errors in EA is less sensitive to the number of bins than the others. Here again, we confirm that all four methods give the same result in the case of Uni-Unif. The most important observation is that very roughly speaking, the number of bins of DM and VM methods that is needed to give the same error of EW and EA methods is about half in the range of $K$ we used in the experiments except for uniform distribution.

From this experiment, we confirm that the proposed method can construct variable bin-width histograms that represent the sample distribution much better than the standard equal width histogram and equal sample size histogram with the same number of bins when the sample distribution is not uniform. Said differently, the proposed method needs about the half number of bins of the conventional EW and EA methods to obtain histograms of the same performance. We will see this in the visualization results.

In the above discussion the error we considered is the total error summed over all bins. We did not assume a target homogeneity. If we do, we can evaluate binwise error and if there are bins that do not satisfy the target, we can increase the number of bins and reconstruct a histogram until all bins satisfy the requirement.

5.5 Appropriateness of visualized histograms and outlier detection

Here we report how the histograms constructed by the four methods look like, how the results in Sections 5.3 and 5.4 are realized in histograms and how the embedded outlier detection works. To save space and sharpen our discussion, we selectively choose and focus only on the informative results from among all the combination of all the variables (real, synthetic unimodal, synthetic bimodal), all the different numbers of bins, and cases with and without outlier detection. We choose the results of $K=$ 8 and 16, latter being close to what Sturges’ formula suggests, and show the results of both with and without outlier detection for the two variables HD and CO2 of the real datasets which characterize most of the encountered situations, and the results of bimodal synthetic datasets with outlier detection which supplements the other details. The results of other combinations are mostly covered by the explanation of the results shown here. In the case of histograms with outlier detection, the detected outliers including the anomalous samples are removed from the dataset and drawn as circles, and the histograms without outliers are redrawn by adjusting the relevant boundaries as explained in Section 4.2.

Figure 6.

Histograms without outlier detection for HD.

Figure 7.

Histograms with outlier detection for HD.

Figure 8.

Histograms without outlier detection for CO2.

Figure 9.

Histograms with outlier detection for CO2.

We show the visualization results of histograms for the real datasets in Fig. 6 to Fig. 9. Each variable comes with two sets of the results, one without outlier detection and the other with outlier detection, and they are arranged in pair. HD has a dense region around HD $=$ 5 and a sparse region HD $>$ 25 (Fig. 6). Both DM and VM methods construct histograms that have more bins with a smaller width than EW method in the dense region and do the reverse for the sparse region. We see that for $K=$ 8 almost half of the bins are assigned in the region $\textit{HD}<$ 10, and the number of bins in this range increases as $K$ gets larger ( $K$ =16). On the other hand, bins in the other range are nearly equally spaced except the rightmost bin where the samples are very sparse. EA method does the same but does so more radically. The resolution of EW method is a bit coarser in the dense range for $K=$ 8, but looks right for $K=$ 16, indicating the appropriateness of Sturges’ formula. As mentioned before, by looking at Figs 4 and 5, roughly speaking errors of the histograms constructed by DM and VM for $K=$ 8 are about the same as the ones constructed by EW and EA methods for $K=$ 16. This is true for the other variables, which implies that DM and VM methods need fewer bins than EW and EA methods to construct a histogram with the same performance (error). Histogram with less number of bins each containing enough number of samples like DM histogram for $K=$ 8 is the results of both large entropy (Section 5.3) and small error (Section 5.4). We also note that samples close to HD $=$ 30 are sparse and result in a wide bin for DM, VM, and EA methods. Some of them should well be detected as outliers. Figure 7 confirms that this is the case. The adjusted histograms look more reasonable. Interestingly no outliers are detected by EW method, thus, there is no change in the histogram. CO2 has a dense region around CO2 $=$ 250 and sparse regions for CO2 $<$ 200 and CO2 $>$ 500 (Fig. 8). We observe the same tendency as HD. Resolution becomes higher for the dense region as $K$ becomes larger. There are wide bins on both ends where the data are sparse for DM, VM, and EA methods except for VM method of $K=$ 16. But again as Fig. 9 indicates, the majority of the samples in these sparse regions are detected as outliers and removed from the data. After the boundaries are adjusted accordingly, the resulting histograms no more have such misleading bins and look very reasonable. As is the case for HD, outlier detection does not work well for EW method. We note that VM method of $K=$ 16 failed to detect outliers in the rightmost tail because it happened that the constructed bin is not wide enough in this region.

In summary, the results obtained by the real data confirm that the proposed DM and VM methods work as intended. The drawback of EW method is clear. Using fixed-size bins, i.e., same resolution throughout the whole bins, has problems. Increasing $K$ is not the answer. Resolution in densely/sparsely distributed regions must be increased/reduced. EA method constructs histograms that have this property but does so more aggressively than desired, e.g., the variation of the height and the bin width is very large and the tail tends to be very long. It has an intrinsic problem of being unable to generate the correct bins. DM and VM methods are the answer. In all the results, as the number of bins is increased, the dense region tends to be divided into bins with smaller widths preferentially, and the sparse region tends to remain undivided. These experimental results indicate that we can successfully visualize the distribution of numeric values as a histogram consisting of high and coarse resolutions with a smaller number of bins by virtue of individual variable bin-widths. Outlier detection works as intended for both, but DM method performs slightly better than VM method.

Figure 10.

Histograms with outlier detection for Bi-Unif.

Figure 11.

Histograms with outlier detection for Bi-Expo.

Figure 12.

Histograms with outlier detection for Bi-Norm.

Next, we move to report the results of the synthetic datasets. Unimodal datasets are easy to analyze and interpret. What we observed in the histograms of the real dataset appears in a more understandable way, and the histograms of the bimodal dataset also have similar features of the histograms of the unimodal dataset. Thus, we focus on the bimodal dataset. The results are shown in Fig. 10 to Fig. 12. The bimodal datasets have anomalous data sample in the middle. We expect that DM and VM methods can successfully detect this anomaly in the middle as well as the outliers in the tails.

Table 1

Annotation terms of each bin ( $K=$ 8)

Rank	DM method			VM method
	Bin	Term	$z$ -score	Bin	Term	$z$ -score
1	1	Humi80-90	72.68	1	Humi80-90	81.56
2	5	Humi50-60	66.80	2	Humi70-80	67.06
3	8	Humi20-30	59.68	4	Humi50-60	58.22
4	8	Temp30-35	59.14	7	Temp30-35	53.66
5	4	Humi60-70	52.80	3	Humi60-70	49.61
6	6	Humi40-50	52.23	6	Humi40-50	48.14
7	3	Humi70-80	50.33	8	Humi20-30	46.71
8	8	Humi30-40	45.95	7	Humi30-40	46.55
9	5	Temp25-30	42.78	2	Temp20-25	43.21
10	7	Humi40-50	40.68	5	Humi40-50	42.46
11	1	Hour18-21	40.51	4	Temp25-30	41.82
12	7	Temp30-35	39.05	7	Humi20-30	39.23
13	6	Temp25-30	38.19	5	Temp25-30	38.42
14	7	Humi30-40	37.95	8	Temp35-40	37.72
15	2	Temp20-25	36.37	1	Hour18-21	37.32
16	2	Humi70-80	33.70	8	Temp30-35	36.07
17	4	Temp25-30	31.54	5	Humi50-60	35.56
18	1	Humi90-100	30.19	6	Humi30-40	31.81
19	7	Hour9-12	29.42	6	Temp30-35	29.63
20	3	Temp20-25	29.30	3	Temp25-30	28.02
Rank	EW method			EA method
	Bin	Term	$z$ -score	Bin	Term	$z$ -score
1	8	Temp35-40	70.99	8	Temp30-35	66.83
2	6	Temp30-35	53.44	7	Humi40-50	64.84
3	3	Humi50-60	50.34	1	Humi80-90	62.19
4	5	Humi40-50	49.65	8	Humi20-30	55.38
5	1	Temp20-25	47.53	8	Humi30-40	54.60
6	1	Humi80-90	46.74	6	Temp25-30	53.48
7	6	Humi20-30	46.08	6	Humi50-60	52.09
8	6	Humi30-40	46.00	2	Humi80-90	48.57
9	3	Temp25-30	43.90	4	Humi70-80	46.97
10	4	Humi40-50	42.02	3	Humi70-80	45.89
11	4	Temp25-30	39.79	6	Humi60-70	45.75
12	2	Humi60-70	39.37	1	Hour18-21	39.57
13	4	Humi50-60	38.67	7	Temp25-30	37.97
14	7	Humi20-30	37.67	1	Humi90-100	35.41
15	5	Temp30-35	35.08	5	Humi60-70	31.62
16	3	Humi60-70	34.84	8	Hour9-12	31.42
17	5	Humi30-40	34.62	2	Temp20-25	28.46
18	7	Temp30-35	30.77	3	Temp20-25	28.07
19	5	Hour9-12	29.19	4	Temp20-25	27.96
20	2	Humi70-80	29.08	8	Hour12-15	27.84

Figure 10 shows the histograms for the Bi-Unif data. As predicted, all the methods construct histograms that contains almost the same number of samples in each bin for the two regions where we have data, regardless of the number of bins $K$ . The histograms are not completely flat due to samplings with a fixed size (10,000 samples in each region). All the methods result in histograms that well represent the data distribution. DM, VM, and EA methods detect the anomaly, but EW method fails to detect it. This is simply because outlier detection by the interquartile range method does not work for a bin with only one data sample. No outliers are detected by all the methods for the bimodal uniform distributions, which is correct because there are no tails. Figure 11 shows the histograms for the Bi-Expo dataset. Both outliers and anomalies are successfully detected for DM, VM, and EA methods except VM method with $K=$ 16. This happened in the real data CO2, too. We suspect that VM method is more likely to fail to detect outliers than DM method because the mean is more sensitive to outliers than the median, which affects the bin width in the tail, and it happened to fail to detect the outliers in this particular condition of $K=$ 16. Once the detected outliers and the anomaly are removed from the data, the resulting histograms better represent the bimodal data distribution. EW method fails to detect both the outliers and the anomaly by the same reason as before. Figure 12 shows the histograms for the Bi-Norm dataset. The resolution around the center in each region is high with the more number of narrower bins. DM, VM and EA methods necessarily have a wide bin on both ends as is the case of CO2 (Fig. 8), but the samples on these tails are easily detected as outliers and removed from the data in all three methods. This time DM method for $K=$ 16 and EW method for $K=$ 8 detect both the outliers and the anomaly successfully.

In summary, the results obtained by the synthetic datasets reconfirm the results obtained by the real datasets. We can conclude that both DM and VM methods encompass good properties of the existing EW and EA methods and accommodate to visualize the data distribution by a histogram with variable bin width using less number of bins. DM method is slightly better suited for outlier detection than VM method due to the difference in sensitivity of the mean and the median to the outliers.

5.6 Evaluation of bin annotation

In the final experiment, we report the results of characterizing the constituent samples for each bin in the histogram and evaluate whether the terms selected are appropriate for this purpose. We show this for only the HD dataset for which we know the respective categorical data. Each sample comes with $I$ nominal variables, each taking one of $J^{(i)}$ categories which characterize the sample, and is placed in one of $K$ bins. We calculate $z_{k}^{(i,j)}$ for all the 3-tuples: the $k$ -th bin, the $i$ -th variable and its category $j$ , and show the top 20 tuples in Table 1 in case of $K=$ 8. The value of the bin column in this table indicates the index of the bin numbered from the left (lowest) in the histogram.

The $z$ -score value greater than 2 means that the number of samples with the corresponding term is more than 2 standard deviations away from the mean in the bin it is placed. We see from Table 1 that even the $z$ -score value for the 20th ranked annotation term is larger than 25, far greater than 2. This means that each bin of the histogram contains a statistically significant number of samples with the characteristic nominal values. This is true for all the methods. Since there are 8 bins and if we annotate them using all 20 terms, each bin should have 2.5 terms on average. Almost all bins have more than one annotation term from the top 20 candidates, mostly 2 or 3 per bin as expected. There is only 1 bin that has one annotation term and 1 bin that has 4 or 5 annotation terms. There is no bin that does not have any term assigned from the top 20 candidates. Thus, at first glance, it looks there are no differences among different methods, but a more detailed analysis reveals the following.

We note that from the definition of HD of Eq. (17), HD is negatively correlated to Humi, more sensitive than Temp, and independent of Hour. It is thus natural to observe that the majority of terms is Humi. If we pick Humi and see how it is distributed with respect to the bin ordering, i.e. the 1st bin to the 8th bin, we notice that it is arranged in descending order almost perfectly for all the methods. Exceptions are the 8-th bin in EW method where no Humi is annotated and the 1st bin in VM and EW method where no Humi90-100 (highest humidity) is annotated. The reason why EW method missed annotation in the 8th bin is that no meaningful statistical test is made because the number of samples in this bin is very small as is evident in Fig. 6(a). Similarly, the reason why EW method missed Humi90-100 in the 1st bin is that the number of samples in this bin is many and buried in many samples, and the statistical test did not pick it in the top 20 list. All the terms from Humi20-30 to Humi90-100 are annotated appropriately for bins in DM and EA methods. As we mentioned before, EA has its inherent problem of possible undrawn bins. Considering all these, we would say that the terms selected for annotation by both DM and VM methods are interpretable and reasonable. VM is slightly worse than DM. This can be explained by the observation in Section 5.3, i.e.,VM is closer to EW.

6. Conclusion

We addressed the problem of constructing histograms with variable bin widths by viewing the task of histogram construction as a kind of one-dimensional clustering problem and solved this as an error minimization problem. The total error is defined as the double sums of the difference of each individual sample value from the centroid value over all the samples in each bin and over all the bins. We used two error criteria: the absolute error and the squared error. The absolute error criterion enforces the centroid to be the sample median and the squared error criterion to be the sample mean. Optimal bin boundaries are obtained by an efficient greedy local improvement search which was developed to solve the change point detection problem. This naturally results in a set of bins, each containing as many homogeneous samples as possible. This formulation ensures that the optimal solution converges to equal width bins and equal sample size bins when the sample distribution is completely uniform and homogeneous. Thus, the constructed variable bin-width histogram has good properties of the standard equal width and equal area histograms when the data distribution is smooth, and constructs bins with small width for the dense region and bins with large width for the sparse region. We attempted to use the constructed histogram to detect and visualize anomalies and outliers that are hidden in the dataset. For this, we used the interquartile range method that does not assume the type of data distribution and applied it to data points within each bin of the histogram. In this paper, we dealt with univariate data. When the data are multivariate, we first compute a measure (score) of anomality and outlierness for each data point and construct a histogram for this measure, to which the same method can be applied. We further proposed a method to annotate the constructed bins for samples which comes with data for annotation as a set of nominal variables. For this, we used $z$ -score with respect to the distribution of the nominal data within each bin and assigned the top few terms to characterize each bin.

We applied our method to the real datasets of humidity deficit and carbon dioxide concentration which are collected from a vinyl greenhouse in operation as well as to two different sets of three synthetic datasets, with each dataset generated from one of the three distributions (uniform, exponential and normal). Each dataset in the first set uses a single distribution without anomalous data and the one in the second set uses well separated two distributions of the same kind with an anomalous data point in the middle, and experimentally confirmed that 1) both DM and VM methods can construct histograms efficiently with the computational complexity of O(NKT) where $K$ is the number of bins, $N$ is the number of iterations to search for each bin boundary and $T$ is the number of data points in the dataset. In reality DM method runs about three times faster than VM method; 2) Both DM and VM methods can construct histograms with appropriate variable bin-widths, consisting of a large number of narrow bins where the samples are densely distributed and drastically change and a small number of wide bins where the samples are sparsely distributed, yet avoiding bins with a too small width and a too small sample size: 3) Both DM and VM methods construct histograms where each bin consists of samples with similar values, i.e., both methods produce more homogeneous bins than EW and EA methods each alone, thus exhibiting a similar behavior of histograms constructed by both EW and EA methods where the changes in sample distribution are moderate; 4) Both DM and VM methods require less number of bins than EW and EA methods and capture the sample distributions appropriately by virtue of variable bin width (roughly about the half for the same error); 5) Both DM and VM methods can detect anomalies and outliers more reliably and accurately than EW and EA methods. 6) Both DM and VM methods construct histograms that are appropriately annotated, i.e., selected annotation terms are interpretable and reasonable. 7) There are differences between histograms constructed by DM and VM method. DM method is closer to EA method and VM method is closer to EW method, DM method is slightly better suited than VM method for outlier detection and bin annotation.

As a future task, we plan to conduct more experiments to see that the proposed DM method can perform well for various types of datasets in other domains, e.g., the educational field, including those with multivariate data, compare the performance with EW, EA and VM methods, and investigate in which situations DM method works better or worse than VM method. There are many factors we have to consider. These include whether domain knowledge help improve the outlier/anomaly detection capability and how noise and missing values of data samples affect the final histogram. We also plan to extend the method to be able to apply to continuously coming streaming data. Further theoretical study to find the optimal number of bins for the variable bin-width histogram is yet another important future work.

Footnotes

Acknowledgments

This work was partly supported by JSPS Grant-in-Aid for Scientific Research (C) (No. 18K11441). We thank Mr. Kiyoto Iwasaki of the Industrial Research Institute of Shizuoka Prefecture, and Prof. Seiya Okubo of the University of Shizuoka for providing and consolidating the agricultural environmental datasets.

References

Araujo

Cuesta

J.A.

and Merelo

J.J.

, Genetic algorithm for burst detection and activity tracking in event streams, In Proceedings of the 9th International Conference on Parallel Problem Solving from Nature (PPSN’06), 2006, pp. 302–311.

Denby

and Mallows

, Variations on the histogram, Journal of Computational and Graphical Statistics – J COMPUT GRAPH STAT 18 (2009), 21–31.

Ebina

Nakamura

and Oyanagi

, A real-time burst detection method, In Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2011, pp. 1040–1046.

Fisher

W.D.

, On grouping for maximum homogeneity, Journal of the American Statistical Association 53(284) (1958), 7890–798.

Freedman

and Diaconis

, On the histogram as a density estimator: l2 theory, Probability Theory and Related Fields 57(4) (1981), 453–476.

Fushimi

Iwasaki

Okubo

and Saito

, Construction of histogram with variable bin-width based on change point detection, In Kralj Novak

Šmuc

and Džeroski

, editors, Discovery Science, Cham, 2019. Springer International Publishing. pp. 40–50.

Goldstein

and Dengel

, Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm, In Wölfl

, editor, Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI-2012), 2012, pp. 59–63.

Ioannidis

Y.E.

, Universality of serial histograms, In Proceedings of the 19th VLDB Conference, VLDB1993, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc, pp. 256–267.

Ioannidis

Y.E.

and Poosala

, Balancing histogram optimality and practicality for query result size estimation, In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD ’95, New York, NY, USA, 1995. Association for Computing Machinery, pp. 233–244.

10.

Irpino

and Romano

, Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation, In Noirhomme-Fraiture

and Venturini

, editors, Extraction et gestion des connaissances (EGC’2007), Actes des cinquièmes journées Extraction et Gestion des Connaissances, Namur, Belgique, 23-26 janvier 2007, 2 Volumes, volume RNTI-E-9 of Revue des Nouvelles Technologies de l’Information, Cépaduès-Éditions, 2007, pp. 99–110.

11.

Kleinberg

, Bursty and hierarchical structure in streams, In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002, pp. 91–101.

12.

Poosala

Ganti

and Ioannidis

Y.E.

, Approximate query answering using histograms, IEEE Data Eng Bull 22(4) (1999), 5–14.

13.

Poosala

Ioannidis

Y.E.

Haas

P.J.

and Shekita

E.J.

, Improved histograms for selectivity estimation of range predicates, In Jagadish

H.V.

and Mumick

I.S.

, editors, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4–6, 1996, pp. 294–305. ACM Press.

14.

Saito

Ohara

Kimura

and Motoda

, Change point detection for burst analysis from an observed information diffusion sequence of tweets, Journal of Intelligent Information Systems 44(2) (2015), 243–269.

15.

Scott

, On optimal and data-based histograms, Biometrika 66(3) (1979), 605–610.

16.

Scott

D.W.

, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley, 1 edition, 8 1992.

17.

Sebastião

and Gama

, Change detection in learning histograms from data streams, In Neves

Santos

and Machado

, editors, Progress in Artificial Intelligence, EPIA 2007, volume 4874 of Lecture Notes in Computer Science, Springer Berlin/Heidelberg, 2007, pp. 112–123.

18.

Sturges

H.A.

, The choice of a class interval, Journal of the American Statistical Association 21(153) (1926), 65–66.

19.

Sun

Zeng

and Chen

, Burst detection from multiple data streams: A network-based approach, IEEE Transactions on Systems, Man, & Cybernetics Society, Part C 40 (2010), 258–267.

20.

Swan

and Allan

, Automatic generation of overview timelines, In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), 2000, pp. 49–56.

21.

Zhang

, Fast Algorithms for Burst Detection, PhD dissertation (New York University), 2006. http://pdf.aminer.org/000/301/507/better_burst_detection.pdf.

22.

Zhu

and Shasha

, Efficient elastic burst detection in data streams, In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 2003, pp. 336–345.

Constructing outlier-free histograms with variable bin-width based on distance minimization

Abstract

Keywords

1. Introduction

1 We occasionally omit anomaly to indicate both for the sake of simplicity, e.g., outlier-free instead of outlier/anomaly-free.

2.1 Histogram construction

2 Microsoft EXCEL uses this strategy.

2.3 Change point detection

2.4 Outlier detection

3. Conventional methods

4.1 Histogram construction

5.1 Datasets and settings

6. Conclusion

Footnotes

Acknowledgments

References

¹
We occasionally omit anomaly to indicate both for the sake of simplicity, e.g., outlier-free instead of outlier/anomaly-free.

²
Microsoft EXCEL uses this strategy.