Abstract
It is common in many fields to be interested in the evaluation of the impact of an intervention over a particular phenomenon. In the context of classical time series analysis, a possible choice might be intervention analysis, but there is no analogous methodology developed for low-count time series. In this article, we propose a modified INAR model that allows us to quantify the effect of an intervention, and is also capable of taking into account possible trends or seasonal behaviour. Several examples of application in different real and simulated contexts will also be discussed.
Introduction
This article focuses on the evaluation of the impact of an intervention over the number of occurrences of a particular phenomenon by using discrete time series techniques. Therefore, unlike in many other applications of time series, the main interest is not in forecasting but in the estimation of the effect of the intervention and its further inference. Many models of discrete time series have been considered in the literature (see McKenzie [2003]), although we focus on integer autoregressive (INAR) models, which are a natural extension to the well-known AR models, and are often easily interpretable in practical contexts. It is usual in many contexts such as public health or sociology to design and conduct an intervention to change some phenomenon behaviour. When dealing with continuous valued time series or series with large counts, intervention analysis may be used with this purpose (see Helfenstein [2005] for a comprehensive handbook). When the time point where the potential change occurs is unknown, as studied by some authors, the change-point analysis can be referred (see Csörgö and Horváth [1997] and Horváth and Rice [2014]). However, for count data, there is not a clear analogous methodology, although there have been some recent contributions (Vasileios, 2015; Liboschik et al., 2016) and an application of the change-point techniques to INAR models (Hudecová et al., 2015). Nonetheless, most of these models are focused on real-time monitoring for structural changes in the series while we are focused on transient or definitive changes in the new observed cases after an intervention through a retrospective analysis.
In addition to the potential effect of the intervention, the phenomenon under study may present a higher incidence at certain times of the year or a trend over time, and these behaviours are not covered by the classical
The proposed model is presented in detail in Section 2, and some approaches to evaluate the goodness of fit are also discussed. In Section 3, three examples of application in different contexts are discussed. The first example is based on a plain simulation, useful to illustrate how the proposed methodology can be applied in a simple framework. The second example discusses the well-known effect of compulsory wearing of seatbelts in Great Britain over the number of killed drivers of light goods vehicles. Finally, the third example deals with a recent public health concern as the effect of the celebration of massive events where unprotected sex is common over the number of cases of sexually transmitted infections (STI).
Model definition
where
where
There are several ways to understand the parameters involved in an
For each intervention we are interested in, we define a dummy variable
because we are considering dependences of order
In fact, the main interest in our context will be to test the hypothesis
The number of cases at time
This can be used to validate how the model fits the real data, calculating the variance of
As will be discussed in the examples in the following section, model selection can be based on statistical significance of the parameters and any usual information criterion as the Akaike information criterion (AIC).
The goodness of fit of the selected model can be assessed through a discretised version of the Cox–Snell residuals (Cox and Snell, 1968), computed from the estimated conditional distribution:
where
The maximization of (2.4) for the examples discussed in this section has been done with a program developed in R (R Core Team, 2016) using the nonlinear minimization procedure nlm, which is available as a supplementary material. The standard errors of the estimates have been calculated from the inverse of the corresponding Hessian matrix.
Example 1: Simulation study
Two INAR(1) processes were simulated, with parameters
Simulated INAR(1) process with an intervention at time
(a) and no intervention (b)
Simulated INAR(1) process with an intervention at time
(a) and no intervention (b)
Therefore, a change is expected to be detected for
Similarly, an INAR(1) process consisting of 500 observations was simulated with parameters
Another approach would be to test for structural changes in the series, following, for example, the methodology described in Hudecová et al. (2015). In this case, as the change in series (a) is structural and the series does not return to the pre-intervention values after the intervention, this methodology is able to detect a change at
The series consists in the monthly number of killed drivers of light goods vehicles in Great Britain from January 1969 to December 1984, including a period after compulsory wearing of seatbelts (February 1983December 1984 1983/ 02 –1984/12), which is the intervention to be evaluated. These data are a subset of a time series discussed in Harvey and Durbin (1986), including all casualties for drivers and passengers of cars, resulting in numbers large enough to be analysed through methods for continuous valued data. Later, the same dataset we are considering here was analysed in Liboschik et al. (2016) finding a significant seasonal effect. This monthly seasonality can be included in (2.3) in a similar way to that of Moriña et al. (2011):
The AIC for an equivalent model using the R package tscount as in Liboschik et al. (2016) is 966.3269, while using this methodology, the AIC is reduced to 923.2088. Using the seasonal INAR(2) model introduced in Moriña et al. (2011) (and omitting the intervention effect), the AIC is 931.1917. From a methodological point of view, the best model can be chosen on the basis of the statistical significance of the parameters or AIC and in this case both criteria point to the model (2.3). The code used to fit all these models is available as a supplementary material.
Maximum likelihood estimates
The estimate for
The observed number of road casualties in the considered period and the model estimates given by (2.7) are shown in Figure 2, together with the 95% approximate confidence interval built using (2.8). It can be seen that most of the observed values are inside this interval (85.9%). The vertical line represents the moment when the use of the seatbelt became mandatory (February 1983).
Observed and estimated values with the 95% confidence interval
Using (2.8) again to build the 95% approximate confidence interval but with estimates based on the seasonal INAR(2) model introduced in Moriña et al. (2011), only 69.3% of the observed values were within the limits.
In Figure 2, it can be seen that the observed counts are rather persistently below the model estimated mean. To check whether it is due to a poor fit of the time series model or it is just a consequence of the fact that the one-step ahead predictive distributions are skewed, so that we should indeed expect to see a majority of the observed counts below the model's mean, the mid-pseudo-residuals approach described in Section 2.2 can be used. The results of this approach are shown in Figure 3, showing that the latter explanation is correct and thus supporting the suitability of this model regarding our purpose of detecting the impact of the introduction of mandatory wearing of seatbelts on the number of road causalties.
Autocorrelation function (ACF) and partial ACF (PACF) of the mid-pseudo-residuals of the model for the number of road casualties in the United Kingdom (1969–1984)
Venereal lymphogranuloma (LGV) is an STI caused by the bacteria chlamydia trachomatis. Due to the popularity that the so-called circuit parties have reached recently, especially among gay and bisexual men, where unprotected sex encounters are common, the impact of these massive events over the number of cases of this and other STI is a public health concern (Cheung et al., 2015; Zenilman and Avery, 2015). The analysed data correspond to the number of LGV cases registered in the Barcelona area from January 2007 to December 2014. The time evolution of these data is shown in Figure 4, and no trend or seasonal behaviour is observed.
Observed number of LGV cases in Barcelona (2007–2014)
Observed number of LGV cases in Barcelona (2007–2014)
According to the ACF and PACF of the process, a model of order 1 seems to be appropriate, so the model
The estimates of the parameters are shown in Table 2, and all of them are again different from zero. In particular,
Maximum likelihood estimates
The AIC of this model is 706.0132, while the AIC corresponding to the standard INAR(1) model is 716.7405. Therefore, the proposed model is preferred. To check the goodness of fit of the model (3.3), the mid-pseudo-residuals approach described in Section 2.2 was used, and the results are shown in Figure 5, supporting the suitability of the model (3.3).
ACF and PACF of the mid-pseudo-residuals of the model for the number of LGV cases in Barcelona (2007–2013)
Following the approach described in Hudecová et al. (2015), using values of
Measuring the impact of planned actions or unexpected events over a time series is an issue that often arises in many fields, including public health, sociology, economics or politics. When dealing with continuous time series, intervention analysis has been widely used in the literature. For instance, in Gilmour et al. (2006), the authors propose a modification of classical intervention analysis to handle the change point with unknown date analysis (a situation where the exact time of an intervention is uncertain). In Huitema et al. (2014), intervention analysis is used to analyse the effect of the introduction of pedestrian countdown timers over the number of accidents involving pedestrians. In this work, we propose a flexible model based on INAR discrete time series models, which is able to suit a wide range of situations focused on the evaluation of the impact of a planned intervention, an unexpected event or the combination of one or more actions or circumstances, so it might be useful to researchers in many areas. In addition to estimating the impact of the intervention, the proposed methodology is capable of providing forecasts, allowing for the midterm estimation of the intervention impact on the counts, improving the performance of the usual INAR models. Regarding the last example, it is shown that this methodology might be used in the public health context to evaluate the impact of certain events that attract a huge (and increasing) number of participants over the incidence of ITS, drug consumption or other problems, although further research is needed to overcome some difficulties as the long incubation periods and the considerable proportion of attendants coming from abroad and therefore unnoticed cases for the event place health authorities.
Supplementary materials
Supplementary materials for this article including the R code and data used in all the examples (a similar simulated dataset was generated for Example 3 as original data were not publicly procurable) are available from
Acknowledgments
We would like to thank Professor Pedro Puig, Professor Tilmann Gneiting and the reviewers for their valuable suggestions that helped to improve the article remarkably. We also thank the Agència de Salut Pública de Barcelona, specially Professor Joan Caylà and Patricia García de Olalla for providing us with the LGV data discussed in the last example.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The research leading to these results has received funding from Recer-Caixa (2015ACUP00129) and was partially supported by grants from the Instituto de Salud Carlos III-ISCIII (Spanish Government) co-funded by FEDER funds/European Regional Development Fund (ERDF)—a way to build Europe (References: RD12/0036/0056, PI11/02090), from the Agència de Gestió dAjuts Universitaris i de Recerca (2014SGR 756, 2014SGR 1307) and the Spanish MINECO (FIS2015-71851-P, MTM2015-69493-R). David Moriña acknowledges financial support from the Spanish Ministry of Economy and Competitiveness through the María de Maeztu Programme for Units of Excellence in R&D (MDM-2014-0445) and from Fundación Santander Universidades.
