Analyzing concept drift: A case study in the financial sector

Abstract

In this paper, we present a method for exploratory data analysis of streaming data based on probabilistic graphical models (latent variable models). This method is illustrated by concept drift tracking, using financial client data from a European regional bank. For this particular setting, the analyzed data spans the period from April 2007 to March 2014 and therefore starts before the beginning of the financial crisis of 2008. The implied changes in the economic climate during this period manifests itself as concept drift in the underlying data generating distribution. We explore and analyze this financial client data using a probabilistic graphical modeling framework that provides an explicit representation of concept drift as an integral part of the model. We show how learning these types of models from data provides additional insight into the hidden mechanisms governing the drift in the domain. We present an iterative approach for identifying disparate factors that jointly account for the drift in the domain. This includes a semantic characterization of one of the main influencing drift factors. Based on the experiences and results obtained from analyzing the financial data, we discuss the applicability of the framework within a more general context.

Keywords

Concept drift latent variable models financial data

1. Introduction

Performing data analysis in a streaming context raises several important issues that are often less pronounced when conducting batch data analysis. In particular, the instances in a data stream can often not be assumed independent, and when the data stream exhibits concept drift the underlying data generating distribution may change over time [10]. If the concept drift inherent in the domain is not carefully taken into account, the result can be a deterioration of accuracy when doing classification or, more generally, failure to capture and interpret intrinsic properties of the data during data exploration.

In collaboration with a European regional bank (Banco de Crédito Cooperativo, BCC), we have been conducting data analysis over a subset of their clients based on client-specific financial information captured during the period from April 2007 to March 2014. Specifically, the focus has been on real-time analysis, detection, and interpretation of financial changes during this period. Particular attention has been given to two groups of clients, defined by whether or not they will default on their financial obligations within the following 12 months.

The period during which data has been collected starts before the beginning of the financial crisis, hence the general economic climate exhibits changes during the collection period. This is also directly reflected in the client data, where we, for instance, see the drifts in the average monthly account balances as illustrated in Fig. 1d. Note that the drift is more pronounced for the defaulting clients than for the non-defaulting clients and that we also see a slightly inverse trend for the two client groups. This is the reason why we included defaulting information in the analysis. Another example of drift can be seen for the unpaid amount in mortgages for the two groups as shown in Fig. 1e.

Figure 1.

Mean evolution of all predictive variables for defaulting and non-defaulting clients (monthly aggregated). The ranges on the y-axes, both here and in successive figures, have been deliberately removed for confidentiality reasons.

Generally, when comparing these two financial indicators, we see that they exhibit different types of concept drift, and that a common/global contextual cause is not immediately apparent from the data. We would like, therefore, to go beyond this immediate analysis and instead consider ‘broader’ types of concept drift that are less variable-specific and which influence and govern several key financial indicators simultaneously and across client groups. We thus adopt the general definition of concept drift from [10], where concept drift is defined as the existence of two consecutive time points for which the joint distribution over the domain variables differ. Since this definition does not rely on a designated target variable, we can position the problem as unsupervised concept drift detection and analysis [19].

In this paper, we further explore and extend a recently proposed model for capturing concept drift [3]. The model proposed by [3] is based on probabilistic graphical models and provides a principled approach for capturing concept drift by letting the drift be encoded explicitly within the model class. There are several advantages to this approach. First of all, the model does not rely on any of the standard techniques to deal with concept drift, see [10] and the references therein. Such techniques, including external concept drift detectors that, e.g., would use changes in classification accuracy to imply a form of supervised concept drift, often require information that only becomes available after a certain time delay [9, 24] (such as the true class labels). Secondly, by letting concept drift be an explicit and integral part of the modeling framework, we have added support for semantic interpretations of potential drifts. In particular, concept drift can immediately be linked to selected model components enabling an analysis of how concept drift affects different parts of the model. The proposed method therefore not only provides an unsupervised way of detecting concept drift, but it may also enable a more systematic analysis of the (local) domain-specific factors that drive concept drift; this type of insight is not immediately provided when, e.g., involving external concept drift detectors.

In comparison with the analysis conducted in [3], the contributions of the present paper includes an extension of the model class of [3] that enables a more fine-grained concept drift analysis targeting individual variables. Furthermore, we propose an iterative approach for identifying disparate factors that jointly account for the drifts in the domain. We demonstrate the use of the proposed methods based on the financial data set supplied by BCC. The results of the analysis include the identification and semantic characterization of one of the key factors governing concept drift for this particular domain. Lastly, we discuss the proposed modeling framework in a more general setting, linking model validation and concept drift analysis. The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) [17, 18, 5].

The remainder of the paper is organized as follows. In Section 2 we give an overview of other related works. In Section 3 we provide a detailed description and analysis of the data set that is used in the study. Section 4 discusses and analyzes the modeling framework introduced by [3], on which the concept drift analysis presented in Section 5 is based. In Section 6 we position our analysis within a more general context, providing a critical discussion of the limitations of our procedure as well as possible extensions to the framework and open research questions. Lastly, we give some concluding remarks in Section 7.

2. Related works

The topic of concept drift is referred to in the literature under different terms (e.g., population drift, concept shift), and sometimes they relate to similar problems but with different nuances. It can refer to a change in the probability distributions of some observed variables, or of some latent parameter or hyperparameter, or even to a change in the problem itself (for instance, when there is a redefinition of the class labels in a classification problem). Several different approaches have been studied in recent years. We make a brief revision of some of them.

In [14], the authors analyze the problem of making predictions in the future when there is a change in the population distributions, from the point of view of the classification performance. To do so, a time series model (linear or generalized linear model) is proposed. In [7], the authors tackle a similar problem. They try to detect changes in the probability distribution over the test set, as compared to the validation set, finding what they call fracture points. Their technique is based on detecting bias using statistical tests, and they evaluate it using various datasets. Another related study is [11], in which the problem of detecting the deterioration of a classifier’s performance is addressed. This method assumes that the class distribution changes from a time point to another, but the conditional distributions remain the same. The approach is tested on a credit scoring real-world dataset.

The work in [6] focuses on the financial domain, especially in the probability of default. The authors use hypothesis tests to determine if there has been a drift in the predictive accuracy. The final decision is based on the $p$ -values obtained from the tests, putting some thresholds. A similar approach is taken in [4], in which the shift detection is based on whether the covariates of the model belong to a certain confidence zone or not.

In [16], the authors propose a technique for detecting shifts in churn models using an information-theory metric called stability index, closely related to the Kullback-Leibler divergence. The decision on the presence of shifts is based on this metric and a threshold. Finally, [20] is a thorough review of the literature of concept drift and related problems. In it, the authors try to provide unifying definitions and concepts, and to clarify the different approaches that have been proposed.

Several of these attempts [6, 7] rely on statistical tests to decide whether there are significant differences between the expected and the actual performance of the model when making predictions. This can be problematic for large datasets, in a streaming context or if any underlying assumption fails. In some other approaches [6, 14, 16], the proposed methodology is able to give some alert when there are significant drifts in the model (often called backtesting), but the model is not updated accordingly.

The approach that we propose in this paper does not rely on the presence of class labels, as all the previous methods do, and it is able to detect different kinds of changes in the data distribution, or concept drifts. Moreover, it deals naturally with streaming data and allows to update the model immediately when new data are available.

3. Description of the data

The data set, provided by BCC, contains monthly aggregated information for a set of BCC clients for the period from April 2007 to March 2014. Only “active” clients are considered, meaning that we restrict our attention to individuals between 18 and 65 years of age, who have at least one automatic bill payment or direct debit in their accounts. To make the data set as homogeneous as possible, we only retain clients residing in the region of Almería (a mainly agricultural area in the south of Spain), and excluded BCC employees, since they have special banking conditions. We reduced the resulting data set so that it only includes 50 000 clients each month.

Up until December 2010 there are some clients that only become active every six months (due to periodic fees). From December 2010 and until the end of the period this pattern appears every 3 months [3]. The particular clients involved vary, and removing them from the overall data set is therefore not feasible. Instead, and in order to avoid the seasonal peaks produced by these known patterns, we remove the affected 21 months.1

¹
The analysis of the experiments in this paper are practically the same if we consider the peak months, except that the results are noisier around these months [3].

Assisted by BCC’s experts, we extracted six variables from the resulting data set that encode monthly aggregated information, and which collectively describe the financial status of a client. Figure 1 shows the evolution of these variables for both defaulting and non-defaulting clients throughout the period. We note that some clients may have missing values for some of the variables for a given month (e.g., because a client was not active during that particular month). However, the generative nature of the models we employ (detailed in the following section) ensures that these missing values are naturally handled within the model and do not need to be treated separately. Finally, each client also has an associated class variable, which indicates if that particular client will default during the following 12 months.

If we take a closer look at the attributes, we observe several characteristics that could further challenge the modeling process. Figure 2 shows the histograms for a couple of variables. The first thing we notice is the high density of zeros, but also the long-tails of the distributions. The latter results in large variances for most of the attributes.

Figure 2.

Histograms for two of the predictive attributes. Ranges in the x-axes have been deliberately removed for confidentiality reasons.

When also considering the evolution of the means in Fig. 1 we can see at least two tendencies in the data. One general trend of gradual and monotonic movement and the other of seasonal changes usually peaking at the end of the year. Thus, the data sets appear to exhibit different types of concept drift.

The drift in the data indicates the need for a model that takes these changes into account. More specifically, we are interested in a simple density estimator that is able to detect the different tendencies over multiple attributes simultaneously while also maintaining the defaulting/non-defaulting distinction of the clients. The following section introduces and discusses the particular model type we have used for the analysis.

4. Modeling concept drift

Concept drift detection and adaptation have typically been considered within a supervised learning context, where changes in model accuracy are seen as an indication of concept drift [10]. This means that concept drift detection is closely linked to a specific prediction task, which may be too restrictive for an exploratory data analysis setting. For example, labeled streaming data is needed in order to evaluate changes in classification accuracy, but these labels often come with a (significant) time delay. For example, for our financial setting described in Section 3, if the future defaulting status of a client is considered as the class variable, then the true class label will only be revealed after a delay of twelve months.

Instead we consider the framework for detecting and analyzing concept drift presented by [3], in which the key idea is to explicitly represent concept drift as an integral part of the model definition without relying on a designated target variable. Thus, the framework considers concept drift as the existence of an instance $\bm{x}$ and two consecutive time points $t_{0}$ and $t_{1}$ such that $p_{t_{0}}(\bm{x})\neq p_{t_{1}}(\bm{x})$ , where $p_{t}(\cdot)$ denotes the density of $\bm{x}$ at time $t$ . Note that this type of concept drift modeling can be considered unsupervised. The concept in this context refers to a joint distribution over the class and predictive variables, but with the class being treated as a normal random variable. Incidentally, this is also what we will refer to as global concept drift, as opposed to local concept drift, that captures concept drift happening at the level of a single variable in the model. A more thorough discussion about the type of concept drift that this framework is able to detect is given in Section 6.

The modeling framework proposed by [3] is illustrated using plate notation in Fig. 3 and can be seen as a special type of probabilistic graphical model [15]. In the figure, $(Y_{i,t},\mathbf{X}_{i,t})$ describes the behavior of client $i$ at time $t$ , where $Y_{i,t}$ represents the defaulting status of client $i$ at time $t$ and $\mathbf{X}_{i,t}$ are the financial indicator variables describing the client.2

²
We explicitly represent the class variable in this context as it is a requirement of the BCC experts to have a clear distinction between defaulter and non-defaulter clients. Since the percentage of defaulter clients is small, this ensures that this group of clients is modeled separately. We stress again, however, that our general concept drift model is not relying on classification accuracy to detect concept drift.

The distributions of

Y_{i,t}

and

\bm{X}_{i,t}

are parameterized using the parameters

\theta_{y}

and

\theta_{x}

, respectively. Concept drift is captured in the model through the latent variables

H_{1},H_{2},\ldots,H_{t},\ldots

, which are “shared” across clients and indicator variables. Intuitively, when learning from data subject to concept drift, the model responds by “tweaking”

H_{t}

over time, thereby using this sequence of latent variables to aggregate the concept drift of each variable to a “model-global” level. To enforce a smooth drift-model, the conditional distribution

H_{t}|\{H_{t-1}=h_{t-1}\}

is defined as a random walk with variance

\theta_{h}

, which has a priori been given preference to smaller movements over time. Note again that the emphasis here is on the concept drift component and not on the specification of an accurate model; hence the simplistic model structure for

(Y_{i,t},\mathbf{X}_{i,t})

. The specific parametric families employed by the modeling framework for analyzing this data set are presented in Section 5, where the general model class will be instantiated to the financial data set.

Figure 3.

Model of concept drift [3]. In this model structure it is assumed that $X_{i,j,t}\bot X_{i,k,t}\,|\,Y_{i,t}$ ; $\forall t\in\{1,\ldots,T\}$ , $\forall j,k\in\{1,\ldots,n\}$ , where $n$ is the number of attributes.

The overall framework is positioned in the Bayesian paradigm, where both parameters and unobserved variables ( $H_{t}$ , $\theta_{x}$ , $\theta_{y}$ , $\theta_{h}$ as well as missing data observations) are treated as random variables in the model. For the data up to and including time $T$ , ${\cal D}_{1:T}$ , inference amounts to calculating the distribution over the variables of interest given ${\cal D}_{1:T}$ , most notably $p(H_{T}|{\cal D}_{1:T})$ .

Bayesian inference is in general NP-hard [8] and for the type of hybrid dynamic models considered in this paper (detailed in the following section) exact inference is intractable. Thus, we resort to approximate inference/learning based on a variational Bayes inference engine [13, 1]. Variational Bayes can be seen as a gradient ascent algorithm, and when constraining the (conditional) distributions to be members of the conjugate exponential family, it can be implemented through variational message passing (VMP) [23].

There are several benefits of this approach, including i) having concept drift as an integral part of a holistic model; ii) concept drift is explicitly represented and therefore open for investigation; iii) immediate model validation; iv) good fit in terms of marginal log-likelihood to complex data, even for this rather parsimonious model.

A proof of concept of this modeling framework can be found in Appendix B, where we analyze two synthetic data sets widely employed as benchmarks in the concept drift literature. This analysis verifies the applicability of this framework for modeling concept drift beyond the financial domain, which is the focus of this work.

5. Analyzing concept drift with hidden variables

In this section, we detail the instantiation of the general methodology presented in Section 4 with respect to our financial data set in order to analyze the trend in the evolution of the financial profile of the clients. For this, we use a publicly available toolbox,3

³
The code and models used in this paper can be downloaded from the AMIDST Toolbox webpage (through its GitHub repository): www.amidsttoolbox.com.

called the AMIDST Toolbox. This toolbox is open source and gives access to a modeling language, where models can be described and combined with inference procedures that support Bayesian learning of the model parameters. Moreover, since the data setup is of a streaming nature, scalability is an important feature of the toolbox. A streaming data set is potentially unbounded, thus inference amounts to doing filtering (also known as the forward pass in dynamic model inference). This means that for any

t

, only the data

{\cal D}_{1:t}

will influence the posterior estimate of

H_{t}

; observations at, e.g.,

t+1

, will not be taken into account.

We consider two general types of models. First, we explore a model containing a hidden variable $H_{j,t}$ for each attribute $X_{j,t}$ with the purpose of analyzing the drift behavior of the different features independently. Secondly, we use a single hidden variable for all features $\bm{X}_{t}$ to capture more global types of concept drift. We subsequently analyze the residuals produced by the global model to identify other factors that jointly account for the drift.

5.1 Local hidden variables

As a first step we employ a variant of the models presented in Section 4 to track concept drift of individual attributes. In this way, we examine whether or not the six attributes in our data set exhibit different drift behavior. This simpler setting also allows us to better illustrate how our approach captures the general trend in the data over time.

We make a simple instantiation of the general framework, where each attribute is linearly dependent on a local hidden variable, which enables the use of efficient learning algorithms. More complex (non-linear) dependencies could eventually be used by employing alternative parametrizations of the conditional probability distributions, at the cost of having to use more complex learning algorithms.

More precisely, we use a concept drift model with a hidden variable $H_{j,t}$ for each attribute. This model can be expressed as follows:

$\displaystyle x^{+}_{i,j,t}=\alpha^{+}_{j}+\beta^{+}_{j}\cdot H_{j,t}+\epsilon% ^{+}_{i,j,t},$ (1) $\displaystyle x^{-}_{i,j,t}=\alpha^{-}_{j}+\beta^{-}_{j}\cdot H_{j,t}+\epsilon% ^{-}_{i,j,t},$

where $x_{i,j,t}$ denotes the value of the $j$ -th attribute of the $i$ -th client at time $t$ . The superscripts $+$ and $-$ refer to the group of defaulter and non-defaulter clients, respectively. The rest of the parameters are defined as random variables following a Bayesian framework (where we have suppressed the $+$ and $-$ to indicate that the same a priori model is assumed for both groups of customers):

$\displaystyle\alpha_{j},\beta_{j},H_{j,0}\sim\mathcal{N}(\mu,\sigma^{2}),$ $\displaystyle\epsilon_{i,j,t}\sim\mathcal{N}(0,\sigma^{2}_{j}),$ $\displaystyle\sigma^{2}_{j}\sim\text{InvGamma}(\alpha,\beta),$ $\displaystyle H_{j,t}\sim\mathcal{N}(H_{j,t-1},\sigma^{2}).$

Using standard properties of the Gaussian distribution, we then have that $X_{i,j,t}|\{\alpha_{j},\beta_{j},\sigma^{2}_{j},h_{j,t}\}\sim\mathcal{N}(% \alpha_{j}+\beta_{j}\cdot h_{j,t},\sigma_{j}^{2})$ . Note that in this model, we have a single hidden variable $H_{j,t}$ that jointly tracks the drift of the profile of the defaulter and non-defaulter clients for the $j$ -th attribute. Furthermore, the attribute specific $\beta_{(\cdot)}$ coefficients can account for potential scale differences among the features.

In Fig. 4 we plot a detailed result of this analysis for two attributes: Account Balance (AB) and Unpaid Amount in Mortgages (UM), respectively. All means in the normal distributions have been arbitrarily initialized to zero: $\alpha_{j},\beta_{j},H_{0}\sim\mathcal{N}(0,\infty)$ , where the variance has been initialized with a sufficiently large number to allow for adaption; $\sigma^{2}_{j}\sim\text{InvGamma}(0,1)$ and $H_{j,t}\sim\mathcal{N}(H_{j,t-1},0.1)$ . Each figure displays the following series:

•

$\{x^{+}_{j,t}\}$ and $\{x^{-}_{j,t}\}$ show the empirical mean of the attribute (for defaulter/non-defaulters clients) at every month, i.e. $x^{+}_{j,t}=1/N^{+}\cdot\sum_{i}x^{+}_{i,j,t}$ , where $N^{+}$ is the number of defaulter clients at month $t$ ( $x^{-}_{j,t}$ is defined analogously). With this series we see how the empirical mean changes over time.

•

$\{\mathbb{E}[H_{j,t}]\}$ shows the expected value of the hidden variable $H_{j,t}$ , which aims at tracking the drift in the empirical means at each month for attribute $j$ .

•

Two series defined by $\{a_{j,t}^{+}\}\doteq\{\mathbb{E}[\alpha_{j}^{+}]+\mathbb{E}[\beta_{j}^{+}]% \cdot\mathbb{E}[H_{j,t}]\}$ and $\{a_{j,t}^{-}\}\doteq\{\mathbb{E}[\alpha_{j}^{-}]+\mathbb{E}[\beta_{j}^{-}]% \cdot\mathbb{E}[H_{j,t}]\}$ with the linear combination of the expected value of the variables $\alpha_{j}^{+}$ , $\beta_{j}^{+}$ , $\alpha_{j}^{-}$ , $\beta_{j}^{-}$ and $H_{j,t}$ at every month. This last series should approximate the series describing the empirical mean of the attribute (cf. Eq. (1)).

Considering Fig. 4 we can make the following tentative conclusions:

•

The series $\{a_{j,t}\}$ try to approximate the empirical mean series. The fit is not perfect because we are using a model with a small number of parameters, that is, we aim to fit two series with 126 values (the empirical monthly means of defaulters and non-defaulters) with a model which contains only 67 parameters (the 63 expected values of the variable $H_{j,t}$ plus the $\alpha_{j}$ and $\beta_{j}$ parameters of both client groups). Still, $\{a_{j,t}\}$ is able to capture the general trend of the empirical means series.

•

The $\{\mathbb{E}[H_{j,t}]\}$ series aim to capture the drift in both empirical mean series $\{x^{+}_{j,t}\}$ and $\{x^{-}_{j,t}\}$ . We note that the drifts in the $\{x^{+}_{j,t}\}$ series are different from the drifts in the $\{x^{-}_{j,t}\}$ series, as we commented in Section 1. The $\{\mathbb{E}[H_{j,t}]\}$ series try to make a compromise between the two different drift trends. This is especially visible at the final stages of both time series (defaulters/non-defaulters) in Fig. 4.

•

The movements of the time series are scaled by the values of their $\alpha_{j}$ and $\beta_{j}$ parameters for the same $\{\mathbb{E}[H_{j,t}]\}$ .4

⁴

Due to confidentiality reasons, we are unfortunately not able to disclose the $\alpha$ and $\beta$ values.

If we take a closer look at the

\alpha_{j}

and

\beta_{j}

values of both defaulters and non-defaulters, we can understand why the same change in

\{\mathbb{E}[H_{j,t}]\}

affects the estimated means differently. Intuitively speaking, the value of

\alpha_{j}

determines the expected mean value of the variables when

\beta_{j}

is zero, whereas

\beta_{j}

determines the change with respect to

\{\mathbb{E}[H_{j,t}]\}

. If we consider the Unpaid amount in mortgages for Fig. 4b, the ratio

\frac{\alpha_{j}}{\beta_{j}}

for non-defaulters is much higher than for defaulters, which means that the former will be less sensitive to changes in

\{\mathbb{E}[H_{j,t}]\}

Figure 4.

Empirical, model means, and the expectation of the local hidden variables for the two feature variables Att4 and Att5. More specifically, $\{x^{-}_{j,t}\}$ and $\{x^{+}_{j,t}\}$ are the empirical mean series for defaulter and non-defaulter clients respectively, $\{a_{j,t}^{-}\}$ and $\{a_{j,t}^{+}\}$ the learned expected means, and $\{\mathbb{E}[H_{j,t}]\}$ are the expected values of the learned hidden variable $H_{j,t}$ for Attribute Att $j$ .

Finally, in Fig. 10 we plot the set of $\{\mathbb{E}[H_{j,t}]\}$ series for all the six attributes analyzed in our financial data set. We note again that each $\{\mathbb{E}[H_{j,t}]\}$ series tries to reflect the joint evolution of the profile of the defaulter and non-defaulter clients with respect to the $j$ -th attribute.

It is interesting to see in Fig. 10 how we can clearly identify two groups of attributes with different evolution trends. On the one hand, we have that the attributes “Total Credit Amount”, “Unpaid Amount in Mortgages” and “Unpaid Amount in Personal Loans” (Att1, Att5 and Att6 respectively) exhibit a kind of monotonically increasing trend over time, with no seasonality. According to our BCC’s experts, they mainly show the financial deterioration of the defaulting clients (c.f. Fig. 1): higher unpaid amount in mortgages and higher total credit loans, although for “Unpaid Amount in Personal Loans” we can see a slow reduction across the period. The latter is because personal loans are typically small short term loans with high-interest rates, which clients prefer to pay back on time. Another effect comes into play here: during the observation period, many weak non-defaulter clients changed to the group of defaulting clients, leaving in the former group those clients that were more robust to changes in the economic climate. This translates into an improvement of the financial profile of the group of non-defaulter clients.

The other group of attributes, “Income”, “Expenses” and “Account Balance” (Att2, Att3 and Att4 respectively), identified in Fig. 10, presents a yearly seasonal pattern down-peaking at the end of the year, which characterizes the particular financial profile of the BCC’s clients. The “Account Balance” attribute seems to have a more complex evolution, which will be discussed in more detail in the next section.

During the analysis above we have deliberately neglected that the estimators $\{\mathbb{E}[\alpha^{+}_{j}]\}$ , $\{\mathbb{E}[\alpha^{-}_{j}]\}$ , $\{\mathbb{E}[\beta^{+}_{j}]\}$ and $\{\mathbb{E}[\beta^{+}_{j}]\}$ can also evolve over time. The time-dependency is a consequence of the definition of the estimators; recall that they are calculated as streaming Bayesian posterior mean values, which in turn are based on the data seen so far. In consequence, the analysis of the $\{\mathbb{E}[H_{j,t}]\}$ series could in principle be hiding other types of concept drift: a constant $\{\mathbb{E}[H_{j,t}]\}$ series would, for example, be interpreted as if there was no concept drift, even though a drift could actually be absorbed by the $\bm{\alpha}_{j}$ and $\bm{\beta}_{j}$ series. We examine this potential issue further in Appendix A. We do so by conducting an off-line analysis to evaluate what happens when the parameters are kept fixed (i.e., we prevent the $\bm{\alpha}$ and $\bm{\beta}$ series to evolve over time), thereby ensuring that the $\{\mathbb{E}[H_{j,t}]\}$ series are the only means for the model to absorb the inherent dynamics. We show that the results in this setting are comparable to those of the procedure outlined above, and therefore conclude that this issue does not invalidate the present analysis.

5.2 Global hidden variables

In the previous section, we looked at the individual trends of each of the attributes. In this section, we are interested in capturing the joint global trend of all of them. For simplicity, let us start by disregarding the defaulter status of the clients, i.e.,

$\displaystyle x_{i,j,t}=\alpha_{j}+\beta_{j}\cdot H_{t}+\epsilon_{i,j,t}.$ (2)

We are now employing a single scalar variable to model the drift of the full set of variables defining the profile of the client (as before $\alpha_{j}$ and $\beta_{j}$ do not evolve over time). Despite this simple structure, the model is flexible enough to capture different interesting types of concept drift as exemplified below:

•

Let us assume we have two series: $\{x_{1,t}\}$ does not change over time (beyond random white noise) while $\{x_{2,t}\}$ linearly increases over time (beyond random white noise). This can be captured by setting $\beta_{1}$ to $0$ and choose a proper positive value for $\beta_{2}$ (both $\alpha_{1}$ and $\alpha_{2}$ need to be properly fixed to fit the data). $\{H_{t}\}$ will then linearly increase reflecting the change of $\{x_{2,t}\}$ .

•

Assume now that $\{x_{1,t}\}$ increases linearly, and that $\{x_{2,t}\}$ decreases linearly at a higher pace. This can be captured by a positive $\beta_{1}$ value and a comparatively larger negative $\beta_{2}$ value. $\{H_{t}\}$ will then increase linearly reflecting the change of $\{x_{1,t}\}$ and $\{x_{2,t}\}$ . $\{H_{t}\}$ could also decrease linearly if we flip the signs of $\beta_{1}$ and $\beta_{2}$ .

By extending the model to also include the defaulter status of the clients, we get

$\displaystyle x^{+}_{i,j,t}=\alpha^{+}_{j}+\beta^{+}_{j}\cdot H_{t}+\epsilon^{% +}_{i,j,t};$ (3) $\displaystyle x^{-}_{i,j,t}=\alpha^{-}_{j}+\beta^{-}_{j}\cdot H_{t}+\epsilon^{% -}_{i,j,t},$

Again, a single hidden variable $H_{t}$ will be used to jointly track the drift over time in the profiles of the two client groups. This extended version corresponds to the model described in Section 4, where variable $\bm{X}$ is conditioned on variable $Y$ .

Figure 11 shows the result of this analysis by plotting the $\{\mathbb{E}[H_{t}]\}$ series. It is interesting to see how this $\{\mathbb{E}[H_{t}]\}$ series displays a combination of monotonic increasing trend with a seasonal change, so it seems to aggregate the different individual trends of each of the attributes. Even more interesting is to look at this series when compared to the unemployment rate in the region of the financial institution when the latter is shifted three months to the past. As can be seen, both series are highly correlated. For example, during most of the first two years, there is hardly any seasonality in either series. However, after this period, starting from February 2009, both the unemployment rate and the $\{\mathbb{E}[H_{t}]\}$ series show a clear overlapping seasonality pattern.

Table 1

Person’s correlation coefficient between the unemployment rate (3 months shifted) and the $\{\mathbb{E}[H_{j,t}]\}$ and $\{\mathbb{E}[H_{t}]\}$ series

$\{\mathbb{E}[H_{1,t}]\}$	$\{\mathbb{E}[H_{2,t}]\}$	$\{\mathbb{E}[H_{3,t}]\}$	$\{\mathbb{E}[H_{4,t}]\}$	$\{\mathbb{E}[H_{5,t}]\}$	$\{\mathbb{E}[H_{6,t}]\}$	$\{\sum\mathbb{E}[H_{j,t}]\}$	$\{\mathbb{E}[H_{t}]\}$
0.926	0.672	0.818	$-$ 0.131	0.872	0.857	0.935	0.961

In Table 1 we show the Pearson correlation coefficient between the unemployment rate (three months shifted) and the $\{\mathbb{E}[H_{j,t}]\}$ and $\{\mathbb{E}[H_{t}]\}$ series. We also compute the Pearson correlation coefficient with respect to the series $\{\sum_{j}\mathbb{E}[H_{j,t}]\}$ , defined by the sum of all the local hidden variables. As can be seen, the correlation achieved by $\{\mathbb{E}[H_{t}]\}$ is higher than the correlation obtained by the rest of the series. This indicates that by using the global model defined in Eq. (2) we are able to better capture the global trend present in our data, which turned out to be largely driven by the unemployment rate.

Correlation does not imply causation, but common sense tells us that when the unemployment rate in a small region moves from 12% to 30% in less than two years, it is difficult to imagine another factor that could have more impact on the financial situation of the inhabitants of this region. Thus, from this analysis, it seems reasonable to postulate that the enormous change in the economic profile of the clients was mainly driven by the changes in the unemployment rate during this period.

Figure 5.

Empirical $\{x^{+}_{j,t}\}$ and model mean $\{a^{+}_{j,t}\}$ (monthly aggregated) for two predictive variables whose trend seems to be captured by the global hidden.

We are now interested in exploring if the changes in the tendency for all the different variables are entirely explained by the $\{\mathbb{E}[H_{t}]\}$ series in this time period, and, consequently, fully determined by the unemployment rate. For that, we plot in Figs 5 and 6 the expected monthly values of the predictive variables as a linear combination of the parameters. That is, we plot the series $\{a_{j,t}^{+}\}$ and $\{a_{j,t}^{-}\}$ together with the empirical means $\{x_{j,t}^{+}\}$ and $\{x_{j,t}^{-}\}$ and analyze the goodness of the fit (in Section 5.4 we discuss a formal way to look at this issue). If all variables were perfectly learned, that would mean that a single global variable would be able to capture all the changes. There are some variables, like Income (Fig. 5) whose trend is very well captured by the global variable, despite the noise. However, if we look at other variables like Account Balance and Unpaid amount in mortgages (Fig. 6), we see that, especially towards the end of the time series, the fit starts to degrade.

Figure 6.

Empirical $\{x^{+}_{j,t}\}$ and learned mean $\{a^{+}_{j,t}\}$ (monthly aggregated) for two predictive variables whose trend is not entirely captured by the global hidden.

In the next subsection, we show how we can extend our approach to determine if there are unexplained trends which have not been captured by our single $\{\mathbb{E}[H_{t}]\}$ series and how we could capture them in a meaningful manner.

As for the local model, we also evaluate the robustness of the $\{\mathbb{E}[H_{t}]\}$ estimates wrt. changes in the series of $\bm{\alpha}$ and $\bm{\beta}$ estimators in Appendix A. Once again we find that the conclusions drawn above are not significantly affected by the potential drift in the parameter estimators.

5.3 Residual analysis

In order to possibly identify other unexplained trends, we look at the residuals defined as the difference between the observed value and the estimated value according to the model specified in Eq. (3),

$\displaystyle r_{i,j,t}=x_{i,j,t}-\mathbb{E}[\alpha_{j}]-\mathbb{E}[\beta_{j}]% \cdot\mathbb{E}[H_{t}],$ (4)

where $\mathbb{E}[\alpha_{j}]$ , $\mathbb{E}[\beta_{j}]$ and $\mathbb{E}[H_{t}]$ denote the expected value of the random variables at month $t$ .

We then employ the same modeling approach we used in the previous sections, but now focusing on the calculated residuals. Firstly, we generate sequences of hidden variables $H^{r}_{j,t}$ to track the drift over time of the residuals for each attribute,

$\displaystyle r_{i,j,t}=\alpha^{r}_{j}+\beta^{r}_{j}\cdot H^{r}_{j,t}+\epsilon% ^{r}_{i,j,t}.$ (5)

Secondly, we generate another sequence of hidden variables $H^{r}_{t}$ to track the drift over time of the residuals for all the attributes jointly,

$\displaystyle r_{i,j,t}=\alpha^{r}_{j}+\beta^{r}_{j}\cdot H^{r}_{t}+\epsilon^{% r}_{i,j,t}.$ (6)

We want to point out that this residual analysis has a straightforward interpretation in terms of multiple hidden variables. That is, an extension of the models given in Eq. (3) or Eq. (1) to include two hidden variables corresponding to $H^{r}_{t}$ and $H^{r}_{j,t}$ respectively. This can be seen if we take expectations in Eq. (5) or Eq. (6) and use the equality of Eq. (4),

$\displaystyle\mathbb{E}[x_{j,t}]=\mathbb{E}[\alpha_{j}]+\mathbb{E}[\alpha^{r}_% {j}]+\mathbb{E}[\beta_{j}]\cdot\mathbb{E}[H_{j,t}]+\mathbb{E}[\beta^{r}_{j}]% \cdot\mathbb{E}[H^{r}_{j,t}]$

for the local model and

$\displaystyle\mathbb{E}[x_{j,t}]=\mathbb{E}[\alpha_{j}]+\mathbb{E}[\alpha^{r}_% {j}]+\mathbb{E}[\beta_{j}]\cdot\mathbb{E}[H_{t}]+\mathbb{E}[\beta^{r}_{j}]% \cdot\mathbb{E}[H^{r}_{t}],$

for the global model. In both cases, $\mathbb{E}[x_{j,t}]$ denotes the expected value of the $j$ -th attribute at time $t$ . Consequently, the following residual analysis results can also be interpreted as trying to capture an additional hidden variable modeling the drift behavior of the profile of the clients over time.

In Fig. 7 we show the $\{\mathbb{E}[H^{r}_{j,t}]\}$ series according to Eq. (5) for the residuals of all the attributes. When we compare these results with the ones displayed in Fig. 10, we can see that the $\{\mathbb{E}[H^{r}_{j,t}]\}$ series displays, for most of the attributes, a much more constant profile than the $\{\mathbb{E}[H_{j,t}]\}$ series. A quantitative evaluation of this fact is given in Table 2 (Rows 1 and 2), where we compare the variance of the $\{\mathbb{E}[H_{j,t}]\}$ and $\{\mathbb{E}[H^{r}_{j,t}]\}$ series, defined as $\text{\emph{Var}}(\{\mathbb{E}[H_{j,t}]\})=\frac{1}{T}\sum_{1}^{T}\left(% \mathbb{E}[H_{j,t}]-\bar{H_{j}}\right)^{2}$ , where $\bar{H_{j}}=\frac{1}{T}\sum_{1}^{T}{\mathbb{E}[H_{j,t}]}$ ; similarly for $\text{\emph{Var}}(\{\mathbb{E}[H^{r}_{j,t}]\})$ . As can be seen, the variance is reduced for all the attributes, except for Att4 (Account Balance). It is interesting to see how the Account Balance attribute clearly diverges showing that the trend of this attribute could not be captured by the previous hidden variable. Something similar happens, but only at the end of the series, for Att1 and Att5.

Table 2

Variances of the $\{\mathbb{E}[H_{j,t}]\}$ , $\{\mathbb{E}[H^{r}_{j,t}]\}$ and $\{\mathbb{E}[H^{r2}_{j,t}]\}$ series; expected mean of the first global hidden variables, the global hidden variable for the first set of residuals and the global hidden variable for the second set of residuals respectively. The index $j$ corresponds to Attribute Att $j$

Series	Att1	Att2	Att3	Att4	Att5	Att6
Var $(\{\mathbb{E}[H_{j,t}]\})$	25.50	8.75	15.50	29.34	41.23	96.93
Var $(\{\mathbb{E}[H^{r}_{j,t}]\})$	8.00	4.10	3.68	43.17	27.17	3.67
Var $(\{\mathbb{E}[H^{r2}_{j,t}]\})$	0.42	3.74	2.94	2.23	7.14	1.34

Figure 7.

Expected values of the local hidden variables for the residuals $\{\mathbb{E}[H^{r}_{j,t}]\}$ , where the index $j$ corresponds to Attribute Att $j$ .

The behavior in Fig. 7 of the Account balance attribute (Att4) can partly be understood by looking at Fig. 6a. Here we can see that the Account Balance attribute has a negative trend until the end of 2008. After that time point, the Account Balance has a positive trend, which even seems to accelerate from December 2012 and until the end of the series. According to BCC’s expert, the first phase until the end of 2008 could show the progressive financial deterioration of weak non-defaulter clients in the first years of the financial crisis. The posterior increase in mean account balance would show that the clients that still remain non-defaulters are the ones with higher savings. This first phase seems to be mainly driven by the increase in the unemployment rate. This is the reason why the $\{\mathbb{E}[H^{r}_{4,t}]\}$ series is largely constant during this period (that is, this change was already explained by $\{\mathbb{E}[H_{t}]\}$ ). The second and third phase of the evolution of Account Balance cannot be explained by the evolution of the unemployment rate. This is when the $\{\mathbb{E}[H^{r}_{4,t}]\}$ series for the Account Balance attribute starts to capture this deviation from the main trend.

Similar conclusions can be extracted for the Unpaid Mortgages attribute (Att5) by looking at Fig. 6c and the $\{\mathbb{E}[H^{r}_{5,t}]\}$ series for this attribute. Again, according to the BCC expert, these three phases are due to the effect of a mortgage restructuring process. During the first years of the crisis, we can again see a financial deterioration of the non-defaulter clients until mid 2009, which is mainly explained by the unemployment rate (that is, the $\{\mathbb{E}[H^{r}_{5,t}]\}$ series is constant for this phase because the trend was already captured by $\{\mathbb{E}[H_{t}]\}$ ). Then, during the period from mid. 2009 to the end of 2010 it was common at the bank to allow clients to restructure their mortgages in order to pay them more easily. This policy aimed to slow down the quick increase in the number of defaulter clients due to the financial crisis. This is the reason why the $\{\mathbb{E}[H^{r}_{5,t}]\}$ series starts to capture something that cannot be explained by the unemployment rate (or the $\{\mathbb{E}[H_{t}]\}$ series). However, as bad economic conditions persisted, these non-defaulter clients with restructured mortgages finally started to default and were moved to the group of defaulter clients. Observe that the unpaid amount in mortgages starts to decrease for non-defaulter clients after 2011, but it increases more quickly for defaulter clients after the same date.

With the above comments in mind, we can better interpret the behavior of the $\{\mathbb{E}[H^{r}_{t}]\}$ series for the model in Eq. (6), plotted in Fig. 8. As can be seen in the figure, the $\{\mathbb{E}[H^{r}_{t}]\}$ series can again identify three different phases which seems to summarize the behavior of the set of $\{\mathbb{E}[H^{r}_{j,t}]\}$ series shown in Fig. 7.

Figure 8.

The unemployment rate $\{$ UR $\}$ as well as the expected value of the global hidden variables for the iterative residuals: $\{\mathbb{E}[H_{t}]\}$ , $\{\mathbb{E}[H^{r}_{t}]\}$ , $\{\mathbb{E}[H^{r2}_{t}]\}$ , and $\{\mathbb{E}[H^{r3}_{t}]\}$ .

In the $\{\mathbb{E}[H^{r}_{t}]\}$ series displayed in Fig. 8, we can also see a rapid increase of the series starting at the beginning of 2013. BCC’s expert argues that this coincides in time with the fusion of BCC with another smaller regional bank as part of a big restructuring process of the financial institutions that took place in Spain that year. However, a deeper analysis should be performed in order to corroborate these conclusions.

5.4 Quantitative model evaluation

The residual approach presented in the previous sections can obviously be repeated in an iterative fashion. This would be equivalent to trying to add more hidden variables for better modeling the drift of the attributes over time. In Fig. 8 we also show the result of this approach by including the $\{H^{r2}_{t}\}$ and $\{H^{r3}_{t}\}$ series, where $\{H^{r2}_{t}\}$ refers to the residuals of $\{H^{r}_{t}\}$ and $\{H^{r3}_{t}\}$ refers to the residuals of $\{H^{r2}_{t}\}$ . It can be seen that at every new iteration the curve becomes more constant, showing that there is less and less trend to be captured as time evolves. This can also be tested in a quantitative way by looking at the prequential marginal log-likelihood of the data according to the different models (with one hidden variable, two hidden variables, three hidden variables, etc.) and comparing them to the simple model $x_{i,j,t}=\alpha_{j}+\epsilon_{i,j,t}$ , i.e., a model without hidden variables.

In Fig. 9, we plot the evolution of the marginal log-likelihood (this is an approximated value by variational methods, which is called the evidence lower bound (ELBO)) of the data for models with none, one, two, three, and four hidden variables. The plot shows that including a few hidden variables is enough to increase the marginal log-likelihood of the model, suggesting that increasing the model complexity beyond that point only yields small improvements.

Figure 9.

Prequential Marginal LogLikelihood for models with different number of hidden variables.

In Table 2 (Row 3), we also show the variance of the $\{H^{r2}_{j,t}\}$ series which corresponds to the local residual analysis associated to $\{H^{r2}_{t}\}$ series (when having three hidden variables). This again shows that the local effect of the variables is again strongly reduced and, in consequence, mainly explained by the series $\{H_{j,t}\}$ and $\{H^{r}_{j,t}\}$ .

6. Discussion

In this section, we first want to highlight that the models used in this paper are simple instantiations of the general model family described in Section 4. For example, we are making the unrealistic assumption that the attributes are independent conditional on the defaulter status of the clients (although this is partly alleviated by the global latent variable), even when knowing that there is a strong correlation between income, expenses, and the account balance of a client. Moreover, we assume that the attributes are normally distributed, which is also an inaccurate assumption considering the histograms displayed in Fig. 2. There exist straightforward ways to alleviate the assumptions about independence and normality in distribution by explicitly linking the observed variables or using extra hidden variables and non-Gaussian distributions. Still, since our goal was to understand the underlying dynamics rather than to find a model that is a perfect fit to the data, we have not pursued this line of investigation. Instead, we have seen that even when using this simple model class we are able to obtain important insights about the general trends governing the evolution of the financial profile of the clients. In our opinion, this is a strong point in favor of the robustness of our approach.

As commented above, the proposed probabilistic concept drift model considered here is able to detect both global and local concept drift, depending on the hidden structure of the model. Additionally, the iterative process of analyzing the residuals helps reveal different levels of concept drift that may be present. For instance, different attributes may vary at different rates in opposite directions so that the trend cannot be captured by simply aggregating the local hidden variables or by using a single global hidden variable.

The presented framework can easily be extended in different directions. Computationally, the only requirement is that we choose distributions s.t. the full model is in the conjugate exponential family. Interesting alternatives include the exponential distribution (for positive real numbers with heavy tails) and the Poisson distribution (for count data). The use of more complex dependency structures between the attributes, reflecting expert knowledge, would also allow us to design more faithful models. Referring to the qualitative characterization of types of concept drift described in [22], the instantiation of the framework employed in Section 5 is specifically targeting gradual, non-reoccurring drift. This model-behaviour is to a large extent defined through the assumed prior distribution for $H_{t}|\{H_{t-1}=h_{t-1}\}$ and $H_{j,t}|\{H_{j,t-1}=h_{j,t-1}\}$ (for the global and local model, respectively). We chose Gaussian distributions with low a priori variance for these dynamic model when analyzing the financial data set, thereby encoding a preference for smooth dynamics in latent space. A larger a priori variance would fit well with situations with incremental or probabilistic drift. Reoccurring drift can be modelled using a mixture-model for the latent variables, where the dynamic model is used to encode an a priori preference for staying in a regime.

Another direction would be to position our concept drift analysis within a batch learning context. In such a setting, real-time analysis is not required, which means that future data can be used for making inferences about the past. In contrast, in this paper, we have pursued, following the requirements of our problem, a streaming approach, where inference is only based on past and present data. The Bayesian framework naturally supports the alternative batch approach, which we can exploit to get a more accurate global picture of our analysis. This is equivalent to the “smoothing” phase usually employed in dynamic systems with hidden variables [15].

The inherent flexibility of Bayesian latent variable models reinforces the importance of model validation. In Section 5.4 we applied a simple approach in order to study the suitability of our model. More complex evaluation procedures, which look at temporal dependencies between the residuals, would be a new line of work to validate the faithfulness of our model with respect to the analyzed data.

7. Conclusions

In this paper, we have used a novel model to capture different sources of concept drift in financial client data from the Spanish bank BCC. The data covers the period from April 2007 to March 2014. Despite the challenging distributions of the analyzed attributes and the simplicity of the applied model, we have been able to detect different trends that on the one hand relate to the general economic climate and on the other to the particular policies implemented by BCC during the period. The analysis is done in a streaming fashion, meaning that inferences drawn at a given point in time $t$ cannot rely on data observed after $t$ . We show that this filtering approach is sufficient to extract interesting concept drift information, and by comparing the generated results to those obtained by utilizing a computationally more expensive non-streaming technique we conclude that on-line analysis is indeed viable for concept drift detection and analysis.

The expected mean of the global concept drift variable in the model correlates almost perfectly with the unemployment rate in the region of the financial institution. It is thus natural to hypothesize that the main driving factor for concept drift is the unemployment rate, a perspective that was corroborated by a BCC expert. The analysis of the residuals has allowed us to pinpoint the attributes that do not follow the trend of the unemployment rate, mainly Account balance and Unpaid amount in mortgages. Closer analysis of these and consecutive residuals, have shown different phases in which we can see the deterioration of the non-defaulter clients on the first years of the crisis, a shift of weak non-defaulter clients to the defaulter state, and more specific actions taken by BCC, like debt restructuring and possibly a fusion with other smaller regional banks.

We have outlined future lines of research both from the point of view of the concept drift detector model and also for the practitioners.

Footnotes

Acknowledgments

The authors would like to thank BCC expert Ramón Sáez for providing valuable insights to the paper. DRL thanks the support from CDTIME (University of Almería), the research group FQM-229, and from Campus de Excelencia Internacional del Mar (CEIMAR) of the University of Almería.

Appendix

Robustness analysis

In this paper, we pursued an approach which is able to model concept drift in a streaming fashion. This approach is based on two different models: the local model described in Section 5.1 and in Eq. (1), and the global model described in Section 5.2 and in Eq. (3). In both models, we assumed that the expected values over $\alpha$ and $\beta$ coefficients (i.e. $\mathbb{E}[\alpha_{j}^{+}]$ , $\mathbb{E}[\alpha_{j}^{-}]$ , $\mathbb{E}[\beta_{j}^{+}]$ , $\mathbb{E}[\beta_{j}^{-}]$ ) were constant over time. We note, however, that this is not entirely accurate, because the expected value is computed from the posterior distribution over the parameters following a Bayesian approach,

$\displaystyle\mathbb{E}[\alpha_{j}^{+}]=\int\alpha_{j}^{+}p(\alpha_{j}^{+}|D_{% 1},\ldots,D_{t})\text{d}\alpha_{j}^{+},$

and this posterior therefore depends on $D_{1:t}$ , the data seen so far, and therefore also on time. More precisely, we should therefore have indexed this expected value with time to reflect this dependency.

Figure 10.

Expected values of the local hidden for all variables. $\{\mathbb{E}[H_{j,t}]\}$ is the expected value of the hidden variable $H_{j,t}$ for Attribute Att $j$ .

Figure 11.

Expected values of the global hidden $\{\mathbb{E}[H_{t}]\}$ , the sum of the expected values for the local hidden variables from Section 5.1 $\{\sum_{j}\mathbb{E}[H_{j,t}]\}$ , and unemployment rate $\{$ UR $\}$ series.

In this appendix we show that disregarding this temporal dependency of the expected values only has a marginal impact on the conclusions we draw from our interpretation of the evolution of the local and global hidden variables. In consequence, we argue that our approach can be safely used to track concept drift in a streaming fashion.

For this purpose we rerun the same experiments whose results were displayed in Figs 10 and 11 for the local and the global model, respectively. During these new experiments, all the $\alpha_{j}^{+}$ , $\alpha_{j}^{-}$ , $\beta_{j}^{+}$ , $\beta_{j}^{-}$ values in the local and global models were fixed across time (i.e. they were not considered random variables in the Bayesian model).5

⁵

This setup is only intended to illustrate the marginal effect the time varying parameters have on the previous analysis. It is not being proposed as an alternative analysis method, which would haven taken the proposed method out of the streaming context.

To find meaningful values for these parameters, we decided to choose the last available estimate of the parameters in the local and global models (i.e. after processing the information for all the months). For example, when rerunning the global model we set the

\alpha_{j}^{+}

value equal to

(7) $\displaystyle\bar{\alpha}_{j}^{+}=\int\alpha_{j}^{+}p(\alpha_{j}^{+}|D_{1:T})% \text{d}\alpha_{j}^{+},$

where $T=$ 84 is the last month of data made available to us. The other $\alpha$ and $\beta$ values were computed in the same way.

In consequence, the modeling equations of the global model (cf. Eq. (3)) were rewritten as follows,

$\displaystyle x^{+}_{i,j,t}=\bar{\alpha}^{+}_{j}+\bar{\beta}^{+}_{j}\cdot H_{t% }+\epsilon^{+}_{i,j,t};$ $\displaystyle x^{-}_{i,j,t}=\bar{\alpha}^{-}_{j}+\bar{\beta}^{-}_{j}\cdot H_{t% }+\epsilon^{-}_{i,j,t},$

where the following entities are now assumed to be random variables according to the Bayesian formulation:

$\displaystyle H_{0}\sim\mathcal{N}(\mu,\sigma^{2}),$ $\displaystyle\epsilon_{i,j,t}\sim\mathcal{N}(0,\sigma^{2}_{j}),$ $\displaystyle\sigma^{2}_{j}\sim\text{InvGamma}(\alpha,\beta),$ $\displaystyle H_{t}\sim\mathcal{N}(H_{t-1},\sigma^{2}).$

We follow the same approach to re-run the local model.

Notice how Eq. (7) takes $\alpha_{j}^{+}$ outside the streaming paradigm; we are utilizing all the data $D_{1:T}$ when estimating $H_{t}$ , even if $t<T$ . The model defined in this appendix is therefore not suitable for online analysis, but will serve as a basis for post-analysis of the results presented in displayed in Figs 10 and 11.

To this end, Figs 12 and 13 show the result of this analysis for the local and global model, respectively. In these figures we plot together the output of both approaches (with and without fixed $\alpha$ and $\beta$ values). In order to appreciate better the comparison, we rescale the series6

⁶

Series cross zero in the middle of the time series by substracting the original value. And all values are divided by the maximum of the series to guarantee a maximum value of one in each series.

. Note that the absolute values of the hidden variables are not relevant for this analysis, only the relative changes. From a statistical point of view, this is not a problem because Gaussian distributions are translation invariant.

Figure 12.

Expected values $\{\mathbb{E}[H_{j,t}]\}$ of all the variables according to the standard local model and for the local model with fixed $\alpha$ and $\beta$ values.

It can be appreciated that in the first months the trend captured the hidden variables (i.e. $\{\mathbb{E}[H_{t}]\}$ and $\{\mathbb{E}[H_{j,t}]\}$ series) hardly match in some cases, but they tend to be much more overlapped in the rest of the months.

In Table 3 we quantitatively evaluate this assessment by computing the Pearson correlation coefficient between both series (i.e. with and without fixed $\alpha$ and $\beta$ values), considering all months and, also, after discarding the first third of the months. With this analysis we can observe than the correlation in that last two thirds of the months is high (except for Att1), while when considering all the months the correlation drops. The reason we find for this situation is that at the beginning $\alpha$ and $\beta$ values are randomly initialized. During the first months $\alpha$ and $\beta$ values are adjusted, in combination with the hidden variables, to fit the data. The prior distribution on the $\alpha$ and $\beta$ values is $\mathcal{N}(0,\infty)$ , see Section 5.1, which means that large changes in their values are allowed, specially when little data has been observed. Estimates of the hidden variables during these first moths are therefore affected by these earlier estimates of the $\alpha$ and $\beta$ values and, in consequence, not very reliable. This is akin to a burn-in phase where the estimates should be discarded. But this problem vanishes as the time goes on and both series (with and without fixed $\alpha$ and $\beta$ values) become strongly correlated. This analysis shows that the trends captured by the local and global method without fixed $\alpha$ and $\beta$ values is reliable after discarding the first time steps.

Table 3

Pearson Correlation coefficient between the $\{\mathbb{E}[H_{j,t}]\}$ ( $\{\mathbb{E}[H_{t}]\}$ ) series according to the standard local (global) model and for the local (global) model with $\alpha$ and $\beta$ values fixed. First row shows the correlation considering all months. Second row consider the correlation considering only the last two thirds of the months

Series	Global	Att1	Att2	Att3	Att4	Att5	Att6
All Months	0.793	$-$ 0.182	0.776	0.855	0.702	0.755	0.867
Last 2/3 Months	0.945	0.703	0.954	0.952	0.984	0.942	0.998

Figure 13.

Expected values of the global hidden $\{\mathbb{E}[H_{t}]\}$ for the standard global model and for the global model with fixed $\alpha$ and $\beta$ values.

Synthetic data sets

We show how the concept drift modeling framework detailed in Section 4 can be used to analyse two synthetic data sets, widely employed as benchmarks in the concept drift literature. All the experiments have been performed using MOA [2], where the developed concept drift model (in Fig. 3) has been integrated as a new Bayesian streaming classifier, named bayes.amidstModels. The Java code to reproduce the experiments can be downloaded from http://amidst.github.io/toolbox/.

Declarations

References

Beal

M.J.

, Variational algorithms for approximate Bayesian inference, PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.

Bifet

Holmes

Kirkby

and Pfahringer

, MOA: Massive Online Analysis, Journal of Machine Learning Research 11 (2010), 1601–1604.

Borchani

Martínez

A.M.

Masegosa

Langseth

Nielsen

T.D.

Salmerón

Fernández

Madsen

A.L.

and Sáez

, Modeling concept drift: a probabilistic graphical model based approach, In Proc. of The Fourteenth Int. Symposium on IDA, 2015, pp. 72–83.

Bravo

and Maldonado

, Fieller stability measure: A novel model-dependent backtesting approach, Journal of the Operational Research Society 66(11) (2015), 1895–1905.

Cabañas

Martínez

A.M.

Masegosa

A.R.

Ramos-López

Samerón

Nielsen

T.D.

Langseth

and Madsen

A.L.

, Financial data analysis with pgms using amidst, In Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on Data Mining, IEEE, 2016, pp. 1284–1287.

Castermans

Martens

Gestel

T.V.

Hamers

and Baesens

, An overview and framework for pd backtesting and benchmarking, Journal of the Operational Research Society 61(3) (2010), 359–373.

Cieslak

D.A.

and Chawla

N.V.

, Detecting fractures in classifier performance, In Seventh IEEE International Conference on Data Mining (ICDM 2007), IEEE, 2007, pp. 123–132.

Cooper

G.F.

, The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks, Artificial Intelligence 42 (1990), 393–405.

Gama

Sebastião

and Rodrigues

P.P.

, On evaluating stream learning algorithms, Machine Learning 90(3) (2013), 317–346.

10.

Gama

Žliobaitė

Bifet

Pechenizkiy

and Bouchachia

, A survey on concept drift adaptation, ACM Computing Surveys 46(4) (2014); 44:1–44:37.

11.

Hofer

and Krempl

, Drift mining in data: A framework for addressing drift in classification, Computational Statistics & Data Analysis 57(1) (2013), 377–391.

12.

Hulten

Spencer

and Domingos

, Mining time changing data streams, In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, 2001, pp. 97–106.

13.

Jordan

M.I.

Ghahramani

Jaakkola

T.S.

and Saul

L.K.

, An introduction to variational methods for graphical models, Machine Learning 37 (1999), 183–233.

14.

Kelly

M.G.

Hand

D.J.

and Adams

N.M.

, The impact of changing populations on classifier performance, In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, Citeseer, 1999, pp. 367–371.

15.

Koller

and Friedman

, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.

16.

Lima

Mues

and Baesens

, Monitoring and backtesting churn models, Expert Systems with Applications 38(1) (2011), 975–982.

17.

Masegosa

A.R.

Martínez

A.M.

and Borchani

, Probabilistic graphical models on multi-core CPUs using Java 8, IEEE Computational Intelligence Magazine 11(2) (2016), 41–54.

18.

Masegosa

A.R.

Mart ınez

A.M.

Ramos-López

Cabañas

Salmerón

Langseth

Nielsen

T.D.

and Madsen

A.L.

, AMIDST: a Java toolbox for scalable probabilistic machine learning, Knowledge-Based Systems 163 (2019), 595–597.

19.

Masegosa

A.R.

Nielsen

T.D.

Langseth

Ramos-López

Salmerón

and Madsen

A.L.

, Bayesian models of data streams with hierarchical power priors, In International Conference on Machine Learning, 2017, pp. 2334–2343.

20.

Moreno-Torres

J.G.

Raeder

Alaiz-RodríGuez

Chawla

N.V.

and Herrera

, A unifying view on dataset shift in classification, Pattern Recognition 45(1) (2012), 521–530.

21.

Street

and Kim

, A streaming ensemble algorithm (SEA) for large-scale classification, In 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2001, pp. 377–382.

22.

Webb

G.I.

Hyde

Cao

Nguyen

H.L.

and Petitjean

, Characterizing concept drift, Data Mining and Knowledge Discovery 30(4) (2016), 964–994.

23.

Winn

J.M.

and Bishop

C.M.

, Variational message passing, Journal of Machine Learning Research 6 (2005), 661–694.

24.

Žliobaitė

Bifet

Read

Pfahringer

and Holmes

, Evaluation methods and decision theory for classification of streaming data with temporal dependence, Machine Learning 98(3) (2014), 455–482.

Analyzing concept drift: A case study in the financial sector

Abstract

Keywords

1. Introduction

3. Description of the data

1 The analysis of the experiments in this paper are practically the same if we consider the peak months, except that the results are noisier around these months [3].

3 The code and models used in this paper can be downloaded from the AMIDST Toolbox webpage (through its GitHub repository): www.amidsttoolbox.com.

7. Conclusions

Footnotes

Acknowledgments

Appendix

Robustness analysis

Synthetic data sets

Declarations

References

¹
The analysis of the experiments in this paper are practically the same if we consider the peak months, except that the results are noisier around these months [3].

³
The code and models used in this paper can be downloaded from the AMIDST Toolbox webpage (through its GitHub repository): www.amidsttoolbox.com.