Abstract
Analyzing the temporal behaviors and revealing the hidden rules of objects that produce time series data to detect the events that users are interested in have recently received a large amount of attention. Generally, in various application scenarios and most research works, the equal interval sampling of a time series is a requirement. However, this requirement is difficult to guarantee because of the presence of sampling errors in most situations. In this paper, a multigranularity event detection method for an unequal interval time series, called SSED (self-adaptive segmenting based event detection), is proposed. First, in view of the trend features of a time series, a self-adaptive segmenting algorithm is proposed to divide a time series into unfixed-length segmentations based on the trends. Then, by clustering the segmentations and mapping the clusters to different identical symbols, a symbol sequence is built. Finally, based on unfixed-length segmentations, the multigranularity events in the discrete symbol sequence are detected using a tree structure. The SSED is compared to two previous methods with ten public datasets. In addition, the SSED is applied to the public transport systems in Xiamen, China, using bus-speed time-series data. The experimental results show that the SSED can achieve higher efficiency and accuracy than existing algorithms.

Introduction
A time series, as a sequence consisting of an observed value and the recorded time, frequently appears in many domains, such as electric power [1, 2], finance [3], agriculture [4], and traffic [5]. Analyzing the time series data can be helpful to discover the temporal behaviors and reveal the hidden rules of the observed object. In recent years, time series analysis has focused on mining user patterns and the events the users are interested in rather than whole time series [6, 7, 8]. Detecting events has become a necessary step for other time series mining tasks, such as classification [9], prediction [10, 11], rule analysis [12], and anomaly detection [13, 14]. An event can be defined as a pattern that frequently occurs in the time-series data [15, 16]. For example, as shown in Fig. 1, segmentations
Example of an event.
Most existing works detect events based on directly calculating the similarity distance of the segmentation pairs, and these approaches are called exact methods. Exact methods first divide the original time series into several segmentations using a sliding window and then obtain the similarity distance between all pairs of segmentations using similarity distance measures such as the Euclidean distance and dynamic time warping (DTW) [17]. Finally, the segmentation pairs, for which the similarity distance is less than the threshold given by the user, will be marked as candidate events.
However, equal interval sampling is the default assumption in a time series analysis. In most application scenarios, it is difficult to guarantee that the sampling interval is equal because of the sampling error. For example, the largest challenge of the exact methods is the scalability of the detected events because fixed-granularity events can only be detected from equal interval time series. In time-series analysis, the time granularity denotes the time span of a segmentation [18, 19]. The typical way to solve this problem is to enumerate the range of granularity given by a user. However, enumerating granularity may cause not only a significant increase in execution time but also some important events to be missed.
To improve the accuracy, approximate methods are used to detect the events by transforming an original time series into a discrete-symbol sequence and then mapping back to the original time series segmentations. For example, SAX (symbolic aggregate approximation) [20] is widely used because of its simplicity and efficiency and can reduce the effects of noise. Through symbolic representation, the events within different lengths can be obtained in the solution. However, the time spans of the events are required to remain equal. Otherwise, the methods may obtain a lower accuracy, even compared to the exact methods [21].
To solve this problem, in this paper, a novel event detection method, SSED (self-adaptive segmenting-based event detection), is proposed. A self-adaptive segmenting algorithm is presented to divide a time series into unfixed-granularity segmentations according to the trend in the time series. All the segmentations are mapped to alphabetical symbols, and the original time series is represented as a sequence of symbols. Then, the SSED detects multigranularity events in the symbol sequence using a tree structure that adopts a novel symbolic representation. In summary, our work has two contributions: 1) a self-adaptive segmenting algorithm is proposed to improve the accuracy of identifying the events in an unequal interval time series; and 2) a novel symbolic representation based on the trend information is proposed that can detect the events with a wider range of granularity in linear time complexity. Compared with existing approximate methods, the SSED can achieve higher accuracy and efficiency than existing algorithms.
The main contributions of this work are summarized as follows.
Self-adaptive segmenting algorithm based on edge points is proposed to divide a time series into unfixed-length segmentations. At the stage of time series segmenting, based on the trend instead of fixed-length time window used in most of the existing methods, SSED divides time series into segmentations. This is because some important information on a curve, such as inflexion may be missed by using fixed-length time window. Therefore, self-adaptive segmenting algorithm can keep complete trends of time series as much as possible.
SSED uses a minimal set of five shape features to represent the maximum amount of change trend in the segmentation. Approximately representing the segmentations by extracting the features of each segmentation can eliminate the effect of noises on performance of clustering. In addition, five shape features reduce segmentation dimensions so that computational complexity can be highly reduced.
The rest of the paper is organized as follows. In Section 2, we present a survey of related work. In Section 3, we give the details of some definitions and notations that we address. Section 3 also presents a mathematical description of the problem. The details of SSED are presented in Section 4. The SSED has been evaluated using different data sets. The evaluation criteria, the design of the experiments, and the results are presented in Section 5. Finally, in Section 6, we present a summary and discuss what we intend to do in future work.
In time series analysis, event detection is an important research issue. It discovers sequence patterns [15] and reveals the temporal behavior of the objects that produce the data [22]. Therefore, event detection is applied to many domains, such as monitoring systems [17], healthcare [22] and geology [23], in which it is more helpful for discovering hidden knowledge.
Some researchers have focused on detecting events by directly calculating the similarity distance between segment pairs, called the exact method. The typical exact method is BF (brutal force) [24]. The main idea of BF is to find motifs by calculating the distance between segments obtained by the sliding window. BF tests all the combinations of segment pairs and records the locations of the pairs of segments with distances that are less than the threshold. In [25], Zhu et al. proposed an exact method based on DTW, which can quickly match the similarity segments by a warping matrix. In [26], Truong et al. proposed a novel clustering-based exact method, which can efficiently discover the subsequence pattern under DTW distance. Later, the NCM (normalized crossmatch) method was proposed in [27], which can discover frequent patterns by calculating the score matrix and the position matrix in the normalized time series data. Based on the matrix profile, which is a vector containing the distances between a subsequence and its nearest neighbor, a method called VLMOD (variable length motif discovery algorithm) was introduced in [28], which can find the exact fixed-length patterns by a fast similarity search algorithm introduced in [29]. In [30], the author proposed a efficiency fixed-length exact method called STOMP, which can optimize the time complexity to O (
Differing from the above methods, to detect events more efficiently, some researchers have focused on approximate event detection by transforming the time series to a discrete symbol sequence. In [20], SAX was proposed to represent the segments as symbols by dividing a time series into equal-length segments, using the PAA (piecewise aggregate approximation) coefficients to represent each segment, and mapping the segments to symbol
Event detection, as an important research topic, is to discovery subsequence pattern from time series. In addition, there are other major research topics on time series, including forecast, correlation analysis, and sequential logic analysis, as shown in Table 1.
Research topics on different domains
Research topics on different domains
As described above, to detect the events, dividing the time series into several equal-length segments is the first step in most methods. Exact methods directly calculate the similarity distance of the segment pairs, while approximate methods map segments to alphabetic symbols. If multigranularity events coexist in large-scale time series, iteratively invoking exact methods to detect all events will be a time-consuming task [23]. Approximate methods can efficiently detect the multigranularity events by transforming the sequences into discrete-symbol sequences. However, it achieves lower accuracy compared to exact methods [21]. Therefore, in time series analysis, a multigranularity event detection method with both accuracy and efficiency is required.
A time series represents a collection of values obtained from sequential measurements over time. Therefore, a time series can be defined as follows.
When examining such time series, they are found to often contain events of interest. The goal of mining a time series is to extract these events. Segmenting an original time series or a datastream into events is an essential processing step. To clarify the presentation, the definition of segmentation is given.
To obtain better fitting performance in a segmenting time series, edge points [36] are introduced as the intersections where the trend changes. For example, in the time series shown in Fig. 2, the left side of point
Example of edge points.
The degree of change generally is the extent to which the level and trend are reflective of the difference between points in a time series.
Generally, the degree of the change between any two points may be arbitrary, large or small. The smaller the value of
In a time series, an event is an important occurrence. Importance is application dependent. In a bus-speed time series, for instance, a traffic block is defined as an event. In this paper, we associate events with the frequent patterns in a time series. The more frequently a pattern appears in a time series, the more likely it is to be an event. The definition of an event is as follows.
The occurrences of
Given a time series
To detect the events with multigranularity, splitting the original time series into segmentations of different sizes is the first step in our work. Then, the symbolization processing of the segmentations is performed to reduce the complexity of the computation.
Self-adaptive segmenting based on edge points
To segment a time series into small sequences with different granularities, the method of extracting the edge points is the first crucial processing step. As described in Definition 5, whether a point can be identified as an edge point depends on the edge amplitude of the point. The rate of change in the time series is the degree to which the observed value changes with time. To measure the degree of the trend difference between two adjacent segmentations, the edge amplitude is used, as described in Definition 4. Therefore, for any two segmentations
where
Taking into account the timestamp, the process of extracting the edge points can support the unequal time intervals of a time series. Therefore, self-adaptive segmentations can be obtained via Eq. (1).
However, the adjacent points of which the edge amplitudes are greater than threshold
Example of time series data.
Edge amplitude of every point.
To improve the quality of segmenting, a self-adaptive segmenting algorithm is proposed to select the edge point with maximum edge amplitude if there are multiple consecutive edge points in a limited time span. Given two edge points
In the self-adaptive segmenting algorithm, the value of adjacent segmentation should be provided to calculate edge amplitude. This value depends on the degree of the change in a time series according to domain knowledge. For example, in public transportation domain, adjacent segmentation could be set to 30 seconds if sampling interval is 15 seconds, because velocity time series fluctuates greatly. In temperature forecast domain, as temperature time series is stable in most cases, adjacent segmentation could be set to a relatively large value, such as 4 hours or more.
Generally, mapping a real-valued time series into a symbol sequence is a necessary step that can eliminate noise in the original time series data and reduce the computational complexity of detecting events. The segmentations resolved by Algorithm 1 in the last section are represented as symbol sequences.
To symbolize segmentation more accurately, we adopt a strategy in which all similar segmentations are mapped to one symbol sequence through clustering methods. The first step is to approximately represent the segmentations by extracting the shape features of each segmentation. The second step is to divide the segmentations into different sets by the clustering method and then represent them as symbols. The degree of similarity between any two segmentations is measured by the shape feature distances. This strategy can reduce the dimensions and eliminate the influence of noise.
To describe the trend of a time series, the five shape features of a segmentation include the mean value, average change rate, standard deviation, first point of segment and last point of segment in our work. Given a segmentation
In addition to the five features above, there are two points of any segmentation
Before clustering, the feature values should be normalized to eliminate their difference, as the five features have different units. In this paper, the z-score is used for normalization, which normalizes the value into numbers between
where
Given a segmentation set
The shape features of the segmentations
An example of the symbolization of segments.
After normalizing the feature set, we use BIRCH (balanced iterative reducing and clustering using hierarchies) [16] to divide the
As there are five features used in a segmentation in
Taking the segmentations as an example, the time series is split into six segmentations, from
If the identified events have different durations, this is so-called multigranularity events. For example, symbol sequences (
Multigranularity events.
In the symbol sequence representing the time series, each symbol corresponds to a set of segmentations with similar shapes. This correspondence means that each symbol has a high degree of support. If the degree of support is greater than minFreq, the symbol is an event we want to detect. In fact, the events in a time series may not be mutually exclusive and often show mutually inclusive relationships. For example, event (
In our method, a tree structure called the suffix tree is adopted. In a suffix tree, the root node only serves to index the child nodes, while the nonroot node stores two kinds of information: a subsequence whose frequency is greater than minFreq and the set of starting positions of the subsequence in the symbol sequence. For the general definitions, a node in the tree is represented by
Given a symbol sequence
To build the suffix tree, nine steps should be followed.
Scan symbol sequence Initialize suffix tree Take the new leaf nodes as the iterative node set INS; Extract a leaf node Filter the sequences with frequencies less than minFreq in the suffix sequence set; if the suffix sequence set is empty, go to step 7; Generate new nodes according to the suffix sequence set and insert them into Remove Judge if there are new nodes generated in this iteration, go back to step 3; Output the root node of the suffix tree.
The pseudocode of the building suffix tree is shown in Algorithm 2.
For example, given a symbol sequence (
Experiment on public transport data
To evaluate the performance of SSED, we take the public transport system in Xiamen, China, as a case study to analyze our method. In this system, the speed value of buses is normally recorded every 15 seconds via GPS (global position system). Therefore, a speed time series is formed for each running bus. However, a large number of time series with unequal time intervals is found. Actually, a wide range of time intervals, from 1 to 32 seconds, is found in the original data because of the latency of the software platform or other reasons. This situation cannot be ignored.
We extract a part of the dataset including the speed time series of 29 buses belonging to line 953 during July 2018. The total length of the time series is 200657, and the sampling intervals vary from 1 second to 32 seconds. The format of the data is described in Table 3.
Data format of the speed time series
Data format of the speed time series
An example of a suffix tree.
The experiment has three steps: 1) dividing the time series into unfixed-granularity segmentations by a self-adaptive segmenting algorithm; 2) clustering the segmentations using BIRCH and symbolizing each cluster; and 3) detecting multigranularity events in the symbol sequence of the time series.
Step 1) Divide the time series into unfixed-granularity segmentations by a self-adaptive segmenting algorithm.
To use the self-adaptive segmenting algorithm, three parameters, window size
A part of the segmentations with edge points in speed time series.
Step 2) Clustering the segmentations using BIRCH and symbolizing each cluster.
To approximate the segmentations, we extract and normalize the shape feature. BIRCH clustering is used in our method. To obtain better clustering effect, the analysis of Calinski-Harabasz [45] is used to try the values of parameter
Cluster set
Step 3) Detecting the multigranularity events in the symbol sequence of the time series.
Algorithm 2 is used to detect the multigranularity events in the symbol sequence. In this experiment, considering the fact that buses generally run on a regular schedule in most cases, setting the minimum frequency of events to 10 can obtain a moderate degree of support. Of course, threshold minFreq can be set higher as well so that events with higher degree of support can be obtained. Then, the symbol sequence is mapped back to the segmentations of the original time series as the final event set. In the final event set, there are 3072 events together with a range of granularities from 15 seconds to 342 seconds. The four events with instances, each of which is composed of the segmentations of the time series, are provided, as shown in Fig. 9.
Four events detected by SSED in speed time series.
To evaluate the results on the Xiamen bus data, the experiment is designed as follows. First, a typical event is detected from the time series on one bus stop. Then, the typical event, as a standard, is used to measure the percentage by which it is detected from the time series on another bus stops. The percentage
Considering that velocity time series during bus arrival and departure shows similar trends, two kinds of events are selected as the standards in this experiment. For example, event
We applied the SSED on another 10 public datasets from the UCR Time Series Archive [46] to evaluate the scalability of SSED on a range of domains. The UCR Time Series Archive is a popular time series library that collects numerous datasets used in time series analysis research. As all required parameters cannot be directly learnt, the method of grid search is chosen to determine the most parameters as it is widely used for parameter optimization. The characteristics of the test datasets and recommended parameter settings are described in Table 5. In the experiment, we set
Characteristics of the selected datasets and recommended parameters of SSED
Characteristics of the selected datasets and recommended parameters of SSED
Two events detected on the Xiamen bus data.
Events detected by the SSED on the ten datasets.
Due to space limitations, we show only two events resolved on each dataset, as shown in Fig. 11. For example, Fig. 11(1) shows two movement events of cardiac action potential during each cardiac cycle; Fig. 11(2) shows two waves of spectral change that are used to distinguish between Robusta and Aribica coffee beans; Fig. 11(3)–(5) show that two kinds of accelerometer trends in three dimensions, X-, Y- and Z-axis directions, taken from actors performing cricket gestures; The events in the rest of figures are related to location changes or image pixel changes.
For quantitative analysis, two methods, the GrammarViz method proposed in [47] and the STOMP method proposed in [30] are chosen for comparison.
The GrammarViz method is chosen for comparison, as GrammarViz and SSED are both approximate methods. GrammarViz has four steps: 1) dividing a raw time series into several overlapping segmentations with equal length by a sliding window; 2) symbolizing each segmentation by using SAX; 3) using the sequitur grammar inference algorithm to detect the multigranularity events in the symbol sequence; and 4) removing the trivial match in the candidate events.
The comparison is designed as follows. The ten public datasets from Section 6.2 are used. On the parameter settings, for the SSED, the parameters are the same as those given in Table 5. For GrammarViz, three parameters are required: sliding window
The STOMP method is an advanced exact method, which is based on a fast algorithm for similarity search to get the matrix profile of the segmentations in time series and then to match the segmentations. As an exact method, two parameters are required: sliding window (
Parameter settings of GrammarViz and STOMP
Parameter settings of GrammarViz and STOMP
The running environment of the computer includes an Intel(R) Core(TM) i5-6500 with 3.20 GHz, and 16 GB of RAM. The SSED is implemented in Python using PyCharm 2018.2.3.
We compare the accuracy and the efficiency of our method with the GrammarViz method and the STOMP method. The concrete indexes for accuracy and efficiency will be introduced in the following subsections.
Error analysis, as an accuracy analysis, is often used to evaluate the performance of a method. Though the STOMP method is an exact method, errors can still exist as fixed time windows is used at the stage of segmenting. Therefore, the STOMP method cannot be used as a criterion.
To measure the accuracy of detecting events, we define error as the average similarity distance between the segmentations in the same clusters. The value of error can be calculated using the following equation.
where
Through calculating the events in the ten datasets using STOMP and SSED, it is observed that SSED has slightly higher error on eight datasets, as shown in Table 7. This is because STOMP is an exact algorithm that can obtain all similar subsequences. Despite this, SSED still has the same or lowest error on two datasets. This is because the five shape features of a segmentation, including the mean value, average change rate, standard deviation, first point of segment and last point of segment adopted by SSED can preserve the important properties of a time series as much as possible. This is an advantage of SSED.
A comparison of the error and runtime (in seconds) of STOMP, GrammarViz and SSED on different data sets (The execution times is given in brackets)
Through calculating the events in the ten datasets using GrammarViz and SSED, it is observed that GrammarViz has the highest error on all datasets, as shown in Table 7, because GrammarViz adopts SAX to symbolize the segmentations. SAX only uses the mean value to represent the segmentation and ignores the trend of the time series. It can result in the increasing possibility of producing false positives in detecting events. SSED achieves the lowest errors on all datasets, as the five shape features of a segmentation are considered to describe the trend of the time series. Therefore, we can conclude that the SSED can obtain higher accuracy than the other approximate methods.
The efficiency represents the execution time of a method. To measure the efficiency of detecting events, we compare the execution time with two other methods, GrammarViz and STOMP.
For STOMP, two parameters are required, sliding window
The efficiency of detecting the events in ten datasets is shown in Table 7. It is observed that the SSED has the highest efficiency compared to the other two methods because SSED detects the events in the symbol sequence rather than the original time series. This detection can dramatically decrease the computational cost. STOMP takes the longest time to enumerate the events in the ten datasets, although the number of iterations is reduced by enumerating within some ranges. In the largest dataset ShapesAll, STOMP requires nearly 10 hours to process the time series with a length of 307800, because more running time is needed for exact methods. The execution time for SSED is just 2.924 seconds.
GrammarViz is an approximate method and uses an inverted index to detect the events in a symbol sequence. Although the execution time of GrammarViz is faster than that of STOMP, it is still longer than that of SSED because GrammarViz obtains all the overlapping segmentations by using a sliding window in the discretization step to avoid losing segmentations. As a result, the dimensionality is almost unchanged after the discretization process. This outcome significantly increases the time consumption in symbolizing segmentations and detecting events. Moreover, GrammarViz requires extra time to eliminate the trivial match from the candidate sets. Therefore, we can conclude that the SSED can obtain higher efficiency than the approximate method and exact method.
Precision and recall
We use precision and recall to evaluate the performance of SSED. Precision is the fraction of identified events that are correct events. Assume that we have identified
The experiment is designed as follows. We create three events the trends of which are simple, medium and complex respectively, as shown in Fig. 12. Each event contains 100 instances. Three synthetic datasets in each of which there is a random walk time series with length 10000 are generated, denoted as dataset1, dataset2, and dataset3. The characteristics of synthetic datasets are shown in Table 8. The 100 event instances of Event
Characteristic of synthetic datasets
Three types of events used in synthetic experiments.
We compare the precision, recall and
Parameter setting of GrammarViz and SSED on synthetic datasets
Comparison of precision and recall between GrammarViz and SSED on synthetic datasets
SSED calculates the edge amplitude of all points in the time series with length
Conclusion
In this paper, a novel method, SSED, is proposed to more efficiently detect the multigranularity events in a time series with unequal time intervals. In order to solve the noise problem caused by unequal interval time series, the five shape features extracted from the trend of a time series are used. Unlike the fixed-granularity detection methods, SSED can detect multigranularity events within a linear time complexity. Additionally, a novel symbolic representation based on the trend information is proposed that can detect events with a wider range of granularity with a linear time complexity. The SSED is of higher precision, efficiency and stronger anti-noise-interference ability.
In future work, considering that real-time data analysis instead of historical data is needed in some domains, we plan to extend the SSED as an incremental method to detect events from the data stream in real time.
Footnotes
Acknowledgments
The project is supported by the Fujian Province Science and Technology Plan, China (grant number 2019H0017); the Quanzhou City Science & Technology Plan, China (grant number 2018C110R). Thanks to Jiang Peizhou from the Xiamen GNSS Development & Application Co. Ltd.
