Identifying influential observations in a time series from the frequency domain point of view

Abstract

This study attempts to explore the influence of observations in a time series or a discrete time signal. The goal is to detect abnormal observations from a frequency domain point of view, while the most of relevant studies have been done from a time domain point of view. The concept of the influence function in the field of robust statistics is borrowed to identify influential observations in a time series. An empirical version of the influence function on the discrete Fourier transform of a time series is designed and subsequently a statistic is proposed to identify influential observations of a time series from the frequency domain point of view. Though the proposed statistic is simple enough to be calculated with simple arithmetic operations, case studies show that the proposed method is capable of identifying influential or abnormal observations of a time series. By identifying influential or abnormal observations, we would be able to gain a better understanding of the nature of a time series and to control possible future influential observations.

Keywords

Discrete Fourier transform influence function robust statistics signal processing statistical estimation time series

1. Introduction

When dealing with a time series or a discrete time signal, it is important to be aware that influential observations may exist. This study attempts to explore the influence of observations in a time series. In this study, in order to achieve the goal of detecting influential observations, we tried it from the frequency domain point of view, though most of the related studies have been done from the time domain point of view. Most of the previous studies have considered this issue mainly on a time domain but this study attempts to identify influential observations affecting the frequency with the large magnitude in the discrete Fourier transform (DFT) of a time series or a discrete time signal.

Peña (1990) studied how to identify influential observations in univariate autoregressive integrated moving average (ARIMA) time series models and presented influence statistics based on the Mahalanobis distance. Lefrançois (1991) presented a method to obtain various measures of the influence for the autocorrelation functions as well as thresholds for declaring an observation over-influential. Bruce and Martin (1989) proposed diagnostics by measuring the change in the parametric estimates of autoregressive integrated moving average models fitted for time series formed by deleting observations from the whole data. Gupta et al. (2013) provided a comprehensive and structured overview of large-scale and interesting outlier definitions for different types of temporal data. Shittu and Shangodoyin (2008) considered the identification of outliers in a frequency domain using the spectral method. Most recently, Ren et al. (2019) proposed a novel algorithm based on spectral residual. In particular, an outlier detection procedure has been proposed by Chen and Liu (1993) for detecting several outlier types in autoregressive integrated moving average time series models such as ‘innovative outliers’, ‘additive outliers’, ‘level shifts’ and ‘temporal and seasonal changes’.

The idea in this study is rooted in the concept of the influential function in the field of robust statistics. An empirical version of the influence function for the discrete Fourier transform of a time series is driven and subsequently a statistic is proposed to identify influential observations of a time series from the frequency domain point of view. The proposed method, which is based on the influence function, is rather straightforward in identifying outliers or influential observations. The contents of this paper are as follows: (1) show how to measure the influence of an observation for the DFT; (2) propose a way to test whether an observation is influential or not by referring to the $\chi^{2}$ -distribution.

Two data sets were considered, one for fine dust levels and the other for retail sales. Both the proposed method and the widely used method by Chen and Liu (1993) as the reference method are applied to identify influential or abnormal observations of the example data sets. We’d like to argue that the method we propose has the same level of performance as the reference method, but is easier to use than the reference method.

2. Methodology

2.1 Influence function

Let $X_{0},\ldots,X_{N-1}$ be a random sample from a distribution $F({x;\theta})$ , where the parameter $\theta$ be a functional of $F$ as $\theta=T(F)$ . The influence function (IF) or the influence curve is then defined by:

$\displaystyle IF(x)=\mathop{\mbox{lim}}\limits_{\epsilon\to 0}\frac{T({({1-% \epsilon})F+\epsilon\Delta_{x}})-T(F)}{\epsilon},$ (1)

where $\Delta_{x}$ is point mass distribution at $x$ . More generally, $\Delta_{x}(t)$ is defined to be 1 if $x\leqslant t$ and 0 otherwise.

The influence function signifies the effect of an infinitesimal contamination at the point $x$ on the estimator (Hampel, 1974; Huber, 1981; Hampel et al., 2011). The influence function can also be expressed as a derivative of $T({F_{\epsilon}})$ with respect to $\epsilon$ , where $F_{\epsilon}=({1-\epsilon})F+\epsilon\Delta_{x}$ , evaluated at $\epsilon=0$ such as

$\displaystyle\frac{d}{d\epsilon}T({F_{\epsilon}})|_{\epsilon=0}.$ (2)

The empirical version of influence function can be obtained by replacing $F$ in Eqs (1) or (2) by an empirical distribution function $\hat{F}_{N}$ as

$\displaystyle\hat{F}_{N}(x)=\frac{1}{N}\sum_{n=0}^{N-1}I(X_{n}\leqslant x),$

where $I$ is the indicator function.

For example, the influence function of the population mean $\mu$ for any observation $x$ is

$\displaystyle IF(x)=x-\mu$

and it’s corresponding empirical influence function (EIF) is

$\displaystyle\textit{EIF}(x)=x-\bar{x},$ (3)

where $\bar{x}$ is a sample mean.

2.2 Proposed method

The discrete Fourier transform transforms a sequence of $N$ (complex) numbers, ${\rm{\bf x}}=\{{x(0),x(1),\ldots,x({N-1})}\}$ , into another sequence of complex numbers, $X=\{{X(0),X(1),\ldots,X({N-1})}\}$ , such as

$\displaystyle X(k)=\frac{1}{N}\sum_{n=0}^{N-1}x(n)e^{-j\frac{2\pi}{N}kn}$ (4)

for $k=0,\ldots,N-1$ (Oppenheim et al., 2001; Hayes, 2009).

In fact, the discrete Fourier transform $X(k)$ is nothing but the mean of $\{{x(n)e^{-j\frac{2\pi}{N}kn}:n=0,\ldots,N-1}\}$ . Similar to Eq. (3), we propose to take

$\displaystyle x(m)e^{-j\frac{2\pi}{N}km}-X(k)$ (5)

for measuring the influence of an observation $x(m)$ on the discrete Fourier transform $X(k)$ .

On the other hand, consider the linear process of the variable $\epsilon_{t}$ such that

$\displaystyle x(t)=\sum_{j=-\infty}^{\infty}\psi_{j}\psi_{t-j},\sum_{j=-\infty% }^{\infty}|\psi_{j}|<\infty,$

where $\epsilon_{t}\sim\textit{iid}({0,\sigma_{\epsilon}^{2}})$ , and then according to Brockwell and Davis (1991), Brillinger (2001) and Shumway and Stoffer (2017), for $\omega_{k}=k/N$ ( $k=0,1,\ldots,N-1)$ , we have

$\displaystyle\frac{2I({\omega_{k}})}{f({\omega_{k}})}\to\chi^{2}(2)\textit{ in% distribution},$ (6)

where $I$ is a periodogram and $f$ is a spectral density function.

Based on Eqs (5) and (6), consider the following two expressions;

$\displaystyle\frac{2I({\omega_{k}})}{f({\omega_{k}})}=\frac{2}{f({\omega_{k}})% }\left|\frac{1}{\sqrt{N}}\sum^{N-1}_{n=0}x(n)e^{-j\frac{2\pi}{N}kn}\right|^{2}$ (7)

and

$\displaystyle\frac{2N}{f({\omega_{k}})}|{X(k)-x(m)e^{-j\frac{2\pi}{N}km}}|^{2}% =\frac{2}{f({\omega_{k}})}\left|\frac{1}{\sqrt{N}}\sum^{N-1}_{n=0}x(n)e^{-j% \frac{2\pi}{N}kn}-\sqrt{N}x(m)e^{-j\frac{2\pi}{N}km}\right|^{2}.$ (8)

If the quantity in Eq. (8) is significantly different from that in Eq. (7), it could be said that $x(m)$ is influential on $X(k)$ in Eq. (4). How much the quantities in Eqs (8) and (7) are different can be decided by reference to the quantiles of $\chi^{2}(2)$ . We will call the expression in Eq. (8) as the influence statistics. An algorithm or scheme to identify the influential observations on $X(k)$ is summarized as follows:

Given a time series $\{{x(0),x(1),\ldots,x({{N}-1})}\}$ and the significant levels $\alpha$ , Pr $[{{\rm{\bf\chi}}^{2}(2)\geqslant\chi_{1-\alpha}^{2}(2)}]=\alpha$ , detrend and/or standardize a time series if it is needed.

Calculate an estimate $\hat{f}(\omega_{k})$ for the spectral density function $f({\omega_{k}})$ of the time series at the dominant frequency $\omega_{k}$ .

For any $x(m)$ , $m=0,1,\ldots,N-1$ , if

$\displaystyle\frac{2}{\hat{f}(\omega_{k})}\left|\frac{1}{\sqrt{N}}\sum_{n=0}^{% N-1}x(n)e^{-j\frac{2\pi}{N}kn}-\sqrt{N}x(m)e^{-j\frac{2\pi}{N}km}\right|^{2}>% \chi_{1-\alpha}^{2}(2),$

then identify $x(m)$ as an influential observation on $X(k)$ .

3. Data analysis

Two examples, (i) a data set of fine dust levels in a city in Asia and (ii) the retail sales data from Hillmer et al. (1982), were selected for empirical studies. We aim to confirm the usefulness of the proposed method in identifying influential observations through data analysis. Having de-trended and centralized the observations, the periodogram is used as an estimator for a spectral density. R-package ‘tsoutliers’ implements the procedure according to the approach described in Chen and Liu (1993) for automatic detection of outliers in time series.

3.1 A fine dust data

A fine dust data set of a city in Asia, from April 2008 to November 2017, is plotted in Fig. 1. By the spectral analysis, the periodogram indicates that the dominant frequency is about 0.083 $\approx$ 10/116 ( $=k/N$ ), that is, the time series is periodic with a period of 12 months ( $=$ 1/0.083). The influence statistics at the dominant frequency, $k=$ 10, are calculated and plotted in Fig. 2. The observations 9, 19, 30, 33, 36, 71 and 80, which are over the threshold $\chi_{0.995}^{2}(2)$ , can be considered as extremely influential or abnormal observations. The observations 5, 7, 31, 81, 110, 112 over the threshold $\chi_{0.95}^{2}(2)$ can also be considered as influential ones.

The tsoutliers returns that the observation numbered 21 (additive), 30 (additive), 36 (additive), 45 (temporary change), 57 (level shifts), 61 (level shifts), 80 (additive) and 82 (level shifts) are found to be outliers. The tsoutliers was executed with a critical value of 3.0 as proposed by Chen and Liu (1993). The tsoutliers are known to identify observations whose fitted values differ significantly from observations as outliers or influential observations.

Although the results of both methods have several things in common, there are also some notable differences. For example, the observation 9 is identified as an influential observation by the proposed method, but not by tsoutliers. In fact, the observation 9 has the second largest value. The numbers 110 and 112 observations have relatively low values and the proposed method indicates them as possibly influential observations, while tsoutliers does not.

Figure 1.

The time series of the fine dust levels. The influential observations are marked by numbers and the observations detected by ‘tsoutliers’ marked as solid circles.

Figure 2.

Influence statistics for the fine dust data when $k=$ 10.

3.2 A monthly retail sales data

The same data analyzed by Chen and Liu (1993) is considered. The data set includes the monthly retail sales of various stores from January 1967 to September 1979. This data was originally discussed in Hilmer et al. (1982). The plot of the monthly retail sales is in Fig. 3. The periodogram indicates that the dominant frequency is about 0.166 $\approx$ 23/141 ( $=k/N$ ). The tsoutliers identifies the observations 12, 48, 60, 84, 96, 108 and 132 as the additive outliers, while the proposed method with $k=$ 23 identifies eighteen observations (Fig. 4), which are over $\chi_{0.995}^{2}(2)$ , including the above seven observations identified by tsoutliers. The tsoutliers identifies most of December sales as outliers whose values are significantly different from their fitted values. Chen and Liu (1993) actually log-transformed the original data, but the results were not significantly different when using the original data as in this study.

On the other hand, the proposed method identifies all December sales and also some January and February sales in later years as outliers or influential observations (Figs 3 and 4). Unlike previous years, January sales since 1974 showed a relatively large decline compared to December.

Figure 3.

The time series of the sales data. The influential observations are marked by numbers and the observations detected by ‘tsoutliers’ marked as solid circles.

Figure 4.

Influence statistics for the sales data when $s=$ 23.

4. Conclusion

A method for finding outliers or influential observations that may exist in time series data are designed from a frequency domain point of view. The proposed method is designed to find the observations affecting the dominant frequency. The case studies show that the proposed method has the same performance as the well-known method, but is easy to use. The proposed method is expected to provide additional insights for identifying anomalous observations.

References

Brillinger

D.R.

(2001). Time series: data analysis and theory, Society forIndustrial and Applied Mathematics, Pennsylvania.

Brockwell

P.J.

, & Davis

R.A.

(1991). Time series: theory and methods, Springer-Verlag, New York.

Bruce

A.G.

, & Martin

R.D.

(1989). Leave-k-out diagnostics for timeseries. Journal of the Royal Statistical Society: Series B(Methodological), 51, 363-401.

Chen

, & Liu

L.M.

(1993). Joint estimation of model parameters andoutlier effects in time series. Journal of the American StatisticalAssociation, 88, 284-297.

Gupta

Gao

Aggarwal

C.C.

, & Han

(2013). Outlier detectionfor temporal data: A survey. IEEE Transactions on Knowledge and dataEngineering, 26, 2250-2267.

Hampel

F.R.

(1974). The influence curve and its role in robustestimation. Journal of the American Statistical Association, 69, 383-393.

Hampel

F.R.

Ronchetti

E.M.

Rousseeuw

P.J.

, & Stahel

W.A.

(2011). Robust statistics: the approach based on influence functions, John Wiley and Sons, New York.

Hayes

M.H.

(2009). Statistical digital signal processing and modeling, John Wiley and Sons, New York.

Hillmer

S.C.

William

, & George

(1982). Modeling considerationsin the seasonal adjustment of economic time series, in Applied Time SeriesAnalysis of Economic Data, Washington DC: U.S. Bureau of the Census, 74-100.

10.

Huber

(1981). Robust statistics, John Wiley and Sons, New York.

11.

Lefrançois

(1991). Detecting over-influential observations in timeseries. Biometrika, 78, 91-99.

12.

Oppenheim

A.V.

Buck

J.R.

, & Schafer

R.W.

(2001). Discrete-timesignal processing, Prentice Hall, Upper Saddle River, NJ.

13.

Peña

(1990). Influential observations in time series. Journal of Business and Economic Statistics, 8, 235-241.

14.

Ren

Wang

Huang

Kou

, & Zhang

(2019). Time-series anomaly detection service at Microsoft. In Proceedings of the25th ACM SIGKDD International Conference on Knowledge Discovery and DataMining, pp. 3009-3017.

15.

Shittu

I.O.

, & Shangodoyin

D.K.

(2008). Detection of outliers in timeseries data: A frequency domain approach. Asian Journal of ScientificResearch, 1, 130-137.

16.

Shumway

R.H.

, & Stoffer

D.S.

(2017). Time series analysis and itsapplications: with R examples, Springer, New York.