Abstract
Abstract:
In some application fields, series are affected by two different types of effects: abrupt changes (or change-points) and functional effects. We propose here a Bayesian approach that allows us to estimate these two parts. Here, the underlying piecewise-constant part (associated to the abrupt changes) is expressed as the product of a lower triangular matrix by a sparse vector and the functional part as a linear combination of functions from a large dictionary where we want to select the relevant ones. This problem can thus lead to a global sparse estimation and a stochastic search variable selection approach is used to this end. The performance of our proposed method is assessed using simulation experiments. Applications to three real datasets from geodesy, agronomy and economy fields are also presented.
Introduction
The motivation for this work initially comes from the change-point detection framework. Indeed, change-point detection problems arise in many fields such as biology (Boys and Henderson 2004), geodesy (Williams 2003; Bertin et al. 2017), meteorology (Caussinus and Mestre 2004; Fearnhead 2006; Wyse et al. 2011; Ruggieri 2013) or astronomy (Dobigeon et al. 2007), for example. The statistical purpose is therefore the detection of these change-points. However, some additional effects have been observed in some series from these fields. For example, in genomics, for comparative genomic hybridization (CGH) data, the change-points are associated to chromosomal aberrations and the studied series can also contain technical artefacts like ‘waves’ (see Picard et al. 2011 and references therein for more details). In geodesy, GPS coordinates series are used to determine accurate station velocities in tectonic studies. These series are affected by two different types of biases: abrupt changes due to equipment changes or earthquakes (change-points) and environmental effects such as soil moisture or atmospheric pressure changes that are reflected in these series by the apparition of periodic signals (Dong et al. 2002). An example of such series is given in Figure 1 where the vertical dotted lines correspond to known changes (antenna changes, clock changes, receiver changes, etc.). The correct and separate estimation of these two effects (the change-points and the ‘additional’ effect) is crucial. Both problems have been considered by Picard et al. 2011 for CGH data and Bertin et al. 2017 for GPS series in a frequentist inference framework. They propose to consider the ‘additional’ effect as a functional effect and to estimate it using splines, wavelets or using a dictionary approach. This latter approach was motivated by the fact that for the GPS application, contrary to the CGH one, the interpretation of the functional part in terms of known functions is important.
Example of a GPS coordinate series (series YAR2 studied in this article). The known changes and reported in databases (Altamimi et al. 2011) are indicated by the vertical dotted lines
Example of a GPS coordinate series (series YAR2 studied in this article). The known changes and reported in databases (Altamimi et al. 2011) are indicated by the vertical dotted lines
In this article, we consider the problem in a Bayesian framework and propose a novel approach that allows us to both estimate the segmentation part (the change-points and the means over the segments) and the functional part. The Bayesian approach has the advantage that expert knowledge can be introduced through prior distributions, as, for example, the known changes in the GPS application. Moreover, posterior distributions allow us for a quantification of the uncertainty, giving in particular posterior probabilities or credible intervals for the change-point locations. This is of particular interest for practitioners.
Here, following the work of Harchaoui and Lévy-Leduc (2010) in a frequentist framework, we propose to consider the change-point detection problem as a variable selection issue in a regression framework, that is new in the Bayesian framework to our knowledge. More precisely, the associated part in the model is expressed as a product of a lower triangular matrix by a sparse vector (with non-zero coordinates corresponding to change-point positions).
The new modelling of the change-point problem and the way of determining the functional effect we propose leads to a unified estimation strategy in terms of variable selection issues. Many efficient methods have been developed for Bayesian variable selection in a regression framework, in particular, the well-known stochastic search variable selection (SSVS) approach (George and McCulloch 1997; George and McCulloch 1993) that we consider in this article.
The remainder of the article is organized as follows. Section 2 presents the hierarchical Bayesian model considered, Section 3 outlines the procedure used to estimate the model parameters. In Section 4, the performance of the proposed method is studied through simulations. The advantages of the Bayesian approach are also illustrated in three real data studies in the fields of geodesy, agriculture and economy. Finally, in Section 6, we discuss and summarize the procedure results and benefits.
Segmentation model with functional part
We observe a series
A classical approach in non-parametric frameworks is to expand the functional part
To estimate the change-points in the series, we follow the strategy proposed by Harchaoui and Lévy-Leduc (2010), which consists in reframing this task in a variable selection context. We denote by
The model (2.1) can be rewritten as follows:
Following George and McCulloch (1993), we first introduce latent variables
Then, as usual in a Bayesian context, these parameters are treated as random variables, assumed here to be independent, and we consider the following prior distributions. The
The posterior distribution of
A classical approach for the computational scheme would be to estimate all of the parameters at the same time
For these two reasons, we propose the following two-step strategy: the first step aims at detecting the positions of the change-points and at selecting the functions, that is, to estimate the latent vectors Estimation of Estimation of
In the following subsections, we give some details of both steps.
Metropolis–Hastings algorithm
The joint posterior distribution integrated with respect to
The number of iterations of this algorithm is
Gibbs sampler algorithm
Once
A Gibbs sampler algorithm is then used. At each iteration, the three parameters should be drawn from its full conditional distribution given by:
Simulation study
In this section, we conduct a simulation study to assess the performance of our proposed method to estimate both the ‘parametric’ part (the segmentation part) and the ‘non-parametric’ part (the functional part). Our method is called SegBayes_SP for ‘semi-parametric’. Moreover, in order to investigate the impact of taking into account the functional part on the estimation of the segmentation part, we compare the results of SegBayes_SP with those of the same Bayesian procedure that includes only the segmentation part in the model (i.e., if the model is supposed to be
Simulation design, parameters of the procedures and quality criteria
For both procedures, we use the same prior parameters. With respect to the Metropolis–Hastings algorithm and the Gibbs sampler, we run each one for 20 000 iterations including 5 000 burn-in iterations. The parameters
For the segmentation part,
- the root mean squared distance between the true mean and its estimate is -the proportion of erroneously detected change-points among detected change-points, denoted by For the functional part,
- the root mean squared distance between - the proportion of erroneously detected functions among detected functions, denoted by
The averages of these criteria over the 100 simulations are considered. Note that we also look at the estimation of the standard deviation of the noise
Note that a perfect segmentation results in both null
Both procedures have been implemented in R (R Core Team, 2009) on quad-core processors 2.8GHz Intel X5560 Xeons with 8MB cache size and 32 GB total physical memory. In particular, the standard Metropolis–Hastings algorithm for procedure SegBayes_SP took 169 minutes, for the 400 simulations (100 series for 4 noises, so it took approximately 25 seconds for each simulation).
In the following, we first analyse the results obtained with the 100 simulated series and then we make some remarks on the study of a particular series.
Average quality criteria on 100 simulated series with
,
,
and
, for SegBayes_SP (semi-parametric model) and SegBayes_P (parametric model)
Average quality criteria on 100 simulated series with
,
,
and
, for SegBayes_SP (semi-parametric model) and SegBayes_P (parametric model)
When the detection problem is easy (
When the noise
Finally, the quality of the estimation of both parts is related to the quality of the estimation of
An advantage of these Bayesian procedures is that we obtain empirical posterior probabilities for the possible change-points and functions. This will be illustrated in the following paragraph.
Average of the estimates
on 100 simulated series with
,
,
and
, for SegBayes_SP (semi-parametric model) and SegBayes_P (parametric model)
The results of the procedure SegBayes_SP for the first series with
The advantage of a Bayesian approach compared to a frequentist one, is that the posterior probabilities give some additional interesting information. For instance, the posterior probabilities of the change-points at positions 0, 7 and 18 are 1, while those of the change-points 35 and 36 are 0.48 and 0.55, respectively (see plots (
On the same simulated series, the result of the procedure SegBayes_P (without considering the functional part) is given in Figure 5. The change-points 7, 18, 36, 59 and 60 are selected with high posterior probabilities. The two last detected change-points are false positive, but they correspond to the Haar 60 from the functional part. As explained in the previous paragraph, when the functional part is forgotten, the segmentation tends to catch it.
Results of SegBayes_SP on the particular simulated series with
. Posterior probabilities for the
and
components (plots (
) and (
)). Traces of the
components 35, 36 and 11 (plots (
, (
) and (
)). The series, the whole true expectation and its estimation, the true positive (TP), false positive (FP) and false negative (FN) change-points are also represented (plot (
)). The true and estimated segmentation part and functional part (plots (
) and (
))
Functions from the dictionary and their corresponding indexes
Result of SegBayes_P on the particular simulated series with
: the series, the whole true expectation and its estimation, the true positive (TP), false positive (FP) and false negative (FN) change-points are also represented
As pointed out previously, the case
Results of SegBayes_SP on the particular simulated series with
. Top: posterior probabilities for the
and
components. Bottom: the series, the whole true expectation and its estimation, the true positive (TP), false positive (FP) and false negative (FN) change-points are also represented
On average, the procedure is not over sensitive to the choice of the prior parameters: from the 21 runs, 10 detected exactly the true change-points and functions, and 8 runs detected the true change-points and functions with a shift or an ‘exchange’. For instance, runs 9 and 13 select change-points at positions 37 and 35, respectively, instead of the 36. Runs 6, 11, 16 and 19 select a change-point at position 10 instead of 7, and functions 9 and 10 (Haar 8 and Haar 9) instead of 11 (Haar 10).
Some other sensitivity remarks can still be made. First, we can see that too small a value for
To study the convergence, we ran the algorithm three times with the same prior parameters as run 1, with 20 000 iterations (5 000 of burn-in), and one time with 50 000 iterations (10 000 of burn-in). The results are given in Table B2 in Appendix B. We observe that 20 000 iterations, including 5 000 of burn-in, seem to be enough to reach convergence since the obtained results are similar for these four runs (or to the previous runs 1, 5, 8, 11, 14, 17 and 20 that have the same prior parameters). The acceptance rates for
Geodetic data
In this section, we propose to use our procedure in the geodesic field for the problem of homogenization of GPS series. Indeed, such series are used to determine accurate station velocities for tectonic and Earth mantle studies (King et al. 2010). However, they are affected by two effects: (i) abrupt changes that are related to equipment changes (documented or not), earthquakes or changes in the raw data processing strategy and (ii) periodic signals that are due to environmental signals, such as soil moisture or atmospheric pressure changes. The correct detection of these effects is fundamental for the aforementioned application.
Here, we consider a particular series (the height coordinate of the series) from the GPS station in Yarragadee, Australia, YAR2 at the weekly scale. The size of the series is
We apply our proposed procedure SegBayes_SP to this series with a dictionary of
The Metropolis–Hastings algorithm is run for 100 000 iterations (30 000 burn-in), with
The results are given in Figure 7. We call ‘validated change-points’ the detected change-points that are documented in databases (in red); ‘unreported change-points’ the detected change-points, but not documented in databases (in blue); and ‘missed change-points’ the reported changes in databases that are not detected here (in green). The different status of the change-points should be tempered. Indeed, small earthquakes may not have been reported in databases, some changes of equipment should have no impact or a delayed impact. Moreover, according to the obtained posterior probabilities, we have chosen a threshold of
Result of the procedure SegBayes_SP on the YAR2 series. Top: posterior probabilities of the change-points and functions. Bottom: Estimated expectation and validated, unreported and missed change-points (based on known equipment changes and malfunctions)
Result of the procedure SegBayes_SP on the YAR2 series. Top: posterior probabilities of the change-points and functions. Bottom: Estimated expectation and validated, unreported and missed change-points (based on known equipment changes and malfunctions)
A total of 5 of the 14 change-points detected by our procedure correspond exactly or are close to known changes reported in databases: 1 085 which corresponds exactly to a clock change, and GPS weeks 1 689 and 1 708 which correspond exactly to radome radar changes. The change-points at GPS week 1 057 and 1 494 may be associated with the receiver changes at GPS week 1 052 and 1 479. At the beginning of the series, a large number of positions declared as change-points may be due to the particular erratic behaviour of the data that contrasts with the rest of the series. Indeed we can observe that this part contains two known changes at GPS week 1 016 (a clock change) and 1 020 (receiver change), followed by many missing data and then again a receiver change at the GPS week 1 052. A possible explanation of this erratic behaviour could then be a calibration problem of the equipment during this period. Note that we also apply our procedure with non-informative priors for the positions of the change-points (all the initial probabilities equal to 0.01). In this case, several documented change-points are no longer detected, showing the interest of using prior knowledge.
Examining the selected functions, four of them were selected:
In this section, we propose to use our procedure in the econometrics field for the problem of the exchange rate. More precisely, we study the daily records of the Mexican peso/US dollar exchange rate from January 2007 to December 2012 (data available at
We apply our proposed procedure SegBayes_SP to this series with a dictionary of
The initial probability for each possible function is 0.01, as well as the initial probability for a position. Concerning the Gibbs sampler algorithm, we run it for 100 000 iterations including 50 000 burn-in iterations and we choose
At the top of Figure 8, the posterior probabilities of the change-points and functions are given. At the bottom of Figure 8, the series with both the estimation of the expectation and the estimated change-points are given. The five most probable change-points detected are dated 3 October 2008, 14 January 2009, 26 March 2009, 3 April 2009 and 14 September 2011.
These are close to the ones obtained by Martínez and Mena (2014) and, as explained in this article, four can be related to events in Mexico and USA: (i) September–October 2008: the 2007–2008 financial crisis; (ii) March–May 2009: the flu pandemic suffered in Mexico; (iii) the US debt-ceiling crisis in 2011. The other change-points detected by Martínez and Mena (2014), that do not correspond to known events, are not detected by our procedure, certainly due to the presence of the functional part.
Result of the procedure SegBayes_SP on the Mexican peso/US dollar exchange rate series. Top: posterior probabilities of the change-points and functions. Bottom: estimated expectation and change-points
Result of the procedure SegBayes_SP on the Mexican peso/US dollar exchange rate series. Top: posterior probabilities of the change-points and functions. Bottom: estimated expectation and change-points
In this section, we propose to use our procedure to analyse the Périgord black truffle production in France. A decrease in the production has been observed, and researchers in agronomy and the truffle orchard managers are interested in exhibiting the causes of this phenomena. In particular, the specialists want to detect abrupt changes due to sociological factors such as wars and rural desertification. Moreover, the production can be affected by periodical climatic variations and long-term tendencies like climatic warming (see Le Tacon et al. (2014)). Using data from the French Ministry of Agriculture, the annual production of Périgord black truffle in the Vaucluse was calculated for the years 1904 to 1972. The access to and the calculation of these reliable data was not easy, and the reader can refer to Le Tacon (2017) for more information. Our procedure SegBayes_SP was applied to these data, with a dictionary of 110 functions that includes the constant function, 26 Fourier functions (
Result of the procedure SegBayes_SP on the Périgord black truffle production dataset. Top: posterior probabilities of the change-points and functions. Bottom: the series and its estimated expectation
In this article, we propose a novel Bayesian method to segment a series with functional effects. The functional part is estimated by a linear combination of functions from a dictionary. Since the dictionary can be large, this approach is flexible and allows us to estimate functions with both smooth components and local irregularities. The estimation procedure consists in selecting the relevant functions of the dictionary. For the change-points, following Harchaoui and Lévy-Leduc (2010), the associated part of the model is reformulated as a variable selection issue. Globally, these two considerations result in the estimation of sparse vectors for which a SSVS approach is applied.
We show the good performance of our procedure in a simulation study and for three real datasets. In particular in the three examples, expected change-points and functions of interest are obtained. The flexible modelling of the functional part allows us to recover periodic components suggested in previous works in the GPS example or exceptional year of production (as an irregularity) for the Périgord black truffe production data.
Our method is based on MCMC algorithms (Metropolis–Hastings algorithm and Gibbs sampler). Although these algorithms can take more time than those used in a frequentist approach, our procedure benefits from the Bayesian framework, which results in two important aspects. The first one is that posterior distributions of the parameters are obtained. From these distributions, different quantities can easily be derived as credibility intervals of the means, the change-points and the selected functions, or the probability to have a change-point in a given interval. Note that obtaining such information is a very intricate task in a frequentist framework. The second important aspect is that we can introduce expert knowledge through prior distributions (see Section 5). For example, as shown in the GPS example for which information about potential change-points are available, some change-points are not detected when non-informative priors are used, whereas they are detected when previous knowledge is taken into account. Note that for this field of application, up until recently, the detection of the abrupt changes was done by a visual inspection.
An important issue is the choice of the criterion to estimate the parameters
To use our procedure, hyper-parameters should be chosen, but the sensitivity analysis shows that the procedure is not over sensitive to these choices.
Finally, this work could be extended in the two following ways: the analysis of multiple series instead of one series, that allows us to improve the estimation of the functional part when this part is shared by all the series as in Picard et al., 2011, and the analysis of time-dependent series.
Acknowledgments
Meili Baragatti, Karine Bertin, Emilie Lebarbier and Cristian Meza are supported by FONDECYT grants 1141256 and 1141258, and the mathamsud 16-MATH-03 SIDRE and SaSMoTiDep 18-MATH-07 projects.
The authors are grateful to F. Le Tacon and J.L. Dupouey for permission to use the agronomic data, which were difficult to collect.
Appendix Integration of the joint posterior distribution
Integrating the joint posterior (2.2) with respect to
Integrating with respect to
SegBayes_SP: prior parameters used in the different runs of the Metropolis–Hastings algorithm applied on the particular series with
. The number of iterations is 20 000 with a burn-in of 5 000 iterations for all runs
SegBayes_SP: results of four runs of the Metropolis–Hastings algorithm applied on the particular series with
, with the same prior parameters than run 1 in Table 2
