Abstract
A family of algorithms for time series classification (TSC) involve running a sliding window across each series, discretising the window to form a word, forming a histogram of word counts over the dictionary, then constructing a classifier on the histograms. A recent evaluation of two of this type of algorithm, Bag of Patterns (BOP) and Bag of Symbolic Fourier Approximation Symbols (BOSS) found a significant difference in accuracy between these seemingly similar algorithms. We investigate this phenomenon by deconstructing the classifiers and measuring the relative importance of the four key components between BOP and BOSS. We find that whilst ensembling is a key component for both algorithms, the effect of the other components is mixed and more complex. We conclude that BOSS represents the state of the art for dictionary-based TSC. Both BOP and BOSS can be classed as bag of words approaches. These are particularly popular in Computer Vision for tasks such as image classification. We adapt three techniques used in Computer Vision for TSC: Scale Invariant Feature Transform; Spatial Pyramids; and Histogram Intersection. We find that using Spatial Pyramids in conjunction with BOSS (SP) produces a significantly more accurate classifier. SP is significantly more accurate than standard benchmarks and the original BOSS algorithm. It is not significantly worse than the best shapelet-based or deep learning approaches, and is only outperformed by an ensemble that includes BOSS as a constituent module.
Introduction
A family of algorithms for time series classification (TSC) involve constructing a dictionary of words from the set of time series then forming a bag of words over that dictionary for each of the time series. More specifically, they run a sliding window across each series, discretise the window to form a word, form a histogram of word counts over the dictionary, then constructing a classifier on the histograms. A recent evaluation of two of this type of algorithm, Bag of Patterns (BOP) and Bag of Symbolic Fourier Approximation Symbols (BOSS), found a significant difference in accuracy between these seemingly similar algorithms. We investigate this phenomena by deconstructing the classifiers and measuring the relative importance of the four key differences between BOP and BOSS. We find that ensembling makes both approaches significantly more accurate, but the effect of the other three components is more complex.
Both BOP and BOSS can be classed as bag of words approaches. These are particularly popular in Computer Vision for tasks such as image classification. Converting approaches for 2-D image classification to 1-D series classification from a range of domains requires careful engineering. We adapt three techniques used in Computer Vision for TSC: Scale Invariant Feature Transform; Spatial Pyramids; and Histrogram Intersection. We find that using Spatial Pyramids in conjunction with BOSS (SP) produces a significantly more accurate classifier. SP is significantly more accurate than standard benchmarks and the original BOSS algorithm. It is not significantly worse than the best shapelet-based approach or a residual deep learning network, and is only outperformed by HIVE-COTE, an ensemble that includes BOSS as a constituent module.
The rest of this document is structured as follows. Section 2 provides an overview of related work, from a broad range of TSC algorithms in Section 2.1, to dictionary-based approaches in particular in Section 2.2. We provide an overview of the Computer Vision framework for bag of words classification in Section 3. Section 4 presents the results of our deconstruction of BOP and BOSS and Section 5 describes our evaluation of enhancements to BOSS. We conclude with Section 6.
Related work
TSC background
A recent experimental study [2] compared and evaluated a diverse set of eighteen TSC algorithms that have been published in leading journals and conferences in the last five years. They proposed the following taxonomy of algorithms.
Algorithms based on raw series
Techniques based on raw series compare two series either as a vector (as with traditional classification) or by a distance measure that uses all data points. In the latter case, measures are typically combined with one-nearest-neighbour (1-NN) classifiers and the simplest variant is to compare series using Euclidean Distance. However, this baseline is easily beaten in practice, and most research effort has been directed toward finding techniques that can compensate for small misalignments between series using specialised elastic distance measures. The almost universal benchmark for whole series measures is Dynamic Time Warping (DTW) but numerous alternatives have been proposed. The most accurate whole series approach (according to the bakeoff comparison [2]) is the Elastic Ensemble (EE) [21], an ensemble of 1-NN classifiers using various elastic measures, including DTW, combined through a proportional voting scheme.
Interval-based algorithms
Rather than use the raw series, the interval class of algorithm select one or more phase-dependent intervals of the series. At its simplest, this involves a feature selection of a contiguous subset of attributes. However, the three most effective techniques generate multiple intervals, each of which is processed and forms the basis of a member of an ensemble classifier [11, 6, 5]. There is no significant difference in accuracy between these approaches, and the simplest is the Time Series Forest (TSF) [11].
Shapelet-based algorithms
Shapelet approaches are a family of algorithms that focus on finding short patterns that define a class and can appear anywhere in the series. A class is distinguished by the presence or absence of one or more shapelets somewhere in the whole series. Shapelets were first introduced in [29]. The two leading ways of finding shapelets are through enumerating the candidate shapelets in the training set [22, 16] or searching the space of all possible shapelets with a form of gradient descent [15]. The bakeoff found that the shapelet transform algorithm used in conjunction with a heterogeneous classifier ensemble (ST-HESCA) is the most accurate approach on average.
Dictionary-based algorithms
Shapelet algorithms look for subseries patterns that identify a class through presence or absence. However, if a class is defined by the relative frequency of a pattern, shapelet approaches will be poor. Dictionary approaches address this by forming frequency counts of repeated patterns. They approximate and reduce the dimensionality of series by transforming into representative words, then compute similarity by comparing the distribution of words. Three of the approaches that have been published in the data mining literature are: Bag of Patterns (BOP) [20]; the Symbolic Aggregate Approximation Vector Space Model (SAXVSM) [28]; and the Bag of Symbolic Fourier Approximation Symbols (BOSS) [26]. We provide an overview of these algorithms in Section 2.2.
Spectral-based algorithms
The frequency domain will often contain discriminatory information that is hard to detect in the time domain. Methods include constructing an autoregressive model [9, 1] or combinations of autocorrelation, partial autocorrelation and autoregressive features [3]. An interval-based spectral ensemble called Random Interval Spectral Ensemble (RISE) was proposed in [23] and shown to be more accurate on average than whole series spectral approaches.
Combinations
Two or more of the above approaches can be combined into a single classifier. For example, an approach that concatenates different feature spaces is described in [17], forward selection of features for a linear classifier is the method adopted in [13]) and transformation into a feature space that represents each group and ensembling classifiers together formed the basis of the Flat-COTE classifier [3]. A modular meta-ensemble of classifiers from each class of algorithms (EE, TSF, BOSS, ST-HESCA and RISE) called HIVE-COTE is currently the state of the art classifier for TSC when evaluated on the UCR/UEA data and simulated problems [23]. However, on individual problems, there is a wide variation between the classifiers, and the ensemble is not always the best approach. The nature of the discriminatory features will dictate the best class of algorithm.
Our basic assumption is that dictionary classifiers will be best for problems where classes are defined by the frequency of occurrence of a shape in each series rather than its binary presence or absence. For example, suppose data contains short sine waves that repeat at random intervals. In one class there are many repeating patterns, in another class there are few.
A whole series and an interval approach will fail because the positioning of the repeating patterns is random. Shapelets will not detect this phenomena because they look for the presence or absence of a pattern. Spectral approaches may do better, but not if there are large intervals between the signals. A dictionary approach should be able to detect that one pattern occurs more frequently in one class than the other. Our objective here is to develop the best dictionary-based TSC algorithm.
We describe the state of the art by summarising previously published, freely available and reproducible results.1
see

To compare multiple classifiers on multiple problems, following the recommendation of Demšar [10], we use the Friedmann test to determine if there were any statistically significant differences in the rankings of the classifiers. However, following recent recommendations in [7, 14], we have abandoned the Nemenyi post-hoc test originally used by [10] to form cliques (groups of classifiers within which there is no significant difference in ranks). Instead, we compare all classifiers with pairwise Wilcoxon signed rank tests, and form cliques using the Holm correction (which adjusts family-wise error less conservatively than a Bonferroni adjustment).
HIVE-COTE is the most accurate algorithm over all, but features of these results excited our interest about dictionary classifiers. Firstly, BOP and SAXVSM performed very poorly. Neither is significantly better than 1-NN Euclidean distance and neither could beat the benchmark classifiers rotation forest and DTW. In stark contrast, BOSS is one of the best performers. It is not significantly worse than the ST-HESCA and only beaten by the two meta ensembles Flat-COTE and HIVE-COTE. HIVE-COTE contains BOSS whereas Flat-COTE does not, and the fact that HIVE-COTE is significantly better than Flat-COTE is further evidence in support of BOSS. On a head to head comparison, BOSS beats BOP on 80 of the 85 datasets. The mean difference in accuracy is over 8%. These algorithms are seemingly similar, so why is BOSS so much better than BOP? Answering this question requires a more in depth understanding of how these algorithms work.
Dictionary-based algorithms share the same basic structure. In summary, a window of length
There are four key stages at which major differences between dictionary based algorithms may arise:
the compression method to get from the discretisation technique used to convert the the methods of representing the collections of transformed subseries; and the distance measure used to compare histograms and/or the classification algorithm used to classify new cases.
BOP (described in Algorithm 2.2.1) is a dictionary classifier built on the Symbolic Aggregate Approximation (SAX) [19] algorithm. SAX reduces
[!ht] buildClassifierBOP (A list of
The Symbolic Aggregate Approximation – Vector Space Model (SAXVSM) [28] combines the SAX representation used in BOP with the Vector Space Model commonly used in Information Retrieval. The key differences between BOP and SAXVSM is that SAXVSM forms word distributions over classes rather than series and weights these by the term frequency/inverse document frequency (
where
Bag of Symbolic Fourier Approximation Symbols (BOSS)
BOSS also uses windows to form words over series, and represents them in a simple histogram format, but it has several major differences to BOP and SAXVSM. BOSS uses a truncated Discrete Fourier Transform (DFT) instead of a PAA on each window. Another difference is that the truncated series is discretised through a technique called Multiple Coefficient Binning (MCB), rather than using fixed intervals. MCB finds the discretising break points as a preprocessing step by estimating the distribution of the Fourier coefficients. This is performed by segmenting the series into disjoint windows, performing a DFT, then finding breakpoints for each coefficient such that each bin contains the same number of elements. The whole process of forming words is called Symbolic Fourier Approximation (SFA). BOSS then involves similar stages to BOP; it windows each series to form the term frequency through the application of DFT and discretisation by MCB, performs numerosity reduction, and forms histograms of the words in each series. A bespoke distance function is used for nearest neighbour classification. This non symmetrical function only includes distances between frequencies of words that actually occur within the first histogram passed as an argument, which refers to the test case.
Another major difference is that BOSS forms an ensemble by retaining all classifiers with training accuracy within 92% of the best during the parameter search of window sizes. New instances are classified by a majority vote of the resulting ensemble. Algorithm 2.2.2 details the construction of histograms for a given parameter set.
[!ht] buildClassifierBOSS (A list of
In a manner reminiscent of the way SAXVSM adapts BOP, BOSS-Vector Space (BOSS-VS) [27] modifies BOSS to form class histograms rather than instance histograms. Switching to class histograms massively reduces the memory requirements and speeds up classification. However, it has no significant effect on accuracy, unless it is to reduce it (see the results in [27]). In this work we are concerned with classification accuracy. The questions we address are, firstly, why is BOSS so much better than BOP (see Section 4) and secondly, can we refine BOSS to make it more accurate (see Section 5).
Computer Vision bag of words framework
The histogram approach used by dictionary classifiers has similarities to many approaches used in the field of Computer Vision. A typical Computer Vision bag of words framework involves the following stages:
extraction of keypoints; description of keypoints; bag forming; and classification based on bags.
BOP and BOSS extract keypoints through sliding a window over the whole series, reducing the size of the number of keypoints through numerosity reduction and a restriction of window sizes. However, approaches for dictionary-based TSC more in line with the Computer Vision approach have been proposed. [4] describes an approach for using Scale Invariant Feature Transform (SIFT) [25] features for use in TSC with dictionary classifiers. We describe this approach in detail in Section 3.1 and have implemented a version in the WEKA based TSC codebase.2
We also consider a common technique in Computer Vision called Spatial Pyramids, proposed in [18] and described in Section 3.2. We try incorporating this as a wrapper for BOSS. It could equally be applied to other dictionary approaches.
A more complex Computer Vision approach applied to TSC is proposed in [30]. This involves a combined approach of peak finding and hybrid sampling to extract keypoints. It uese Histogram of Oriented Gradients (HOG-1D) and Dynamic Time Warping-Multidimensional Scaling (DTW-MDS) to form features describing the keypoints. It then clusters them with a
The Bag of Temporal SIFT Words (BOTSW) algorithm [4] adopts a version of the Computer Vision bag of words framework that is easier to reproduce than that described in [30], not least because the C
Each block is described by two values: the sum of positive gradients in the block and the sum of negative gradients. A feature vector that describes a keypoint at a particular scale is a 2
In [4], a Support Vector Machine was used to classify feature vectors. However, our objective is to assess the utility of the SIFT features in relation to the BOSS features. Hence, to minimize the differences between BOSS and BOP we use 1NN classification. The parameters
The essence of dictionary classifiers is to ignore temporal information through consideration of the recurrence of short subseries. Whilst this will lead to good results in problem domains with repeated discriminatory features, the disadvantage is that in some domains the location in time of a pattern is as important as the pattern itself. Spatial pyramids [18] are a method commonly used in Computer Vision, which will allow us combine temporal and phase independent features. When applied to time series, using a spatial pyramid involves recursively segmenting each series and constructing histograms on the segments.
Starting from the initial histogram across the whole series, histograms on subsections are formed by repeatedly dividing the series
Because of the weighting, similarity between features found at smaller divisions on the series have a more significant effect than those found on a more global scale, as their temporal location becomes increasingly dissimilar. It is also worth noting that a pyramid with one level is equivalent to the basic bag of words, as no division has occurred.
Since BOSS ensembles over different window sizes so that discriminatory patterns of different lengths can all be accounted for, we search for
A series from the BeetleFly dataset, being divided at successive levels with Bags of SFA words being formed for each subsection. H1…7 are combined to form the final feature vector.
[!ht] buildBOSSEnsembleSP (A list of
k in wordLengths() featureSet
L in 2,3 featureSet
featureSets.add (bestWindowFeatureSet, bestWindowAcc) maxWindowAcc
While computing the pyramids is very fast relative to the original production of the SFA words, the additional space complexity is a concern for large datasets as the final elongated histograms will be
A core task in any bag of words/dictionary based technique is to compare the differences between the resulting histograms in order to define class membership. BOP uses Euclidean Distance, SAX-VSM uses Cosine Similarity, and BOSS its own measure which is a slight alteration to Euclidean. We also test the Histogram Intersection similarity measure described in [18] which is used in many different applications involving histograms. For a dictionary and resulting histogram size of
From BOP to BOSS
We perform all experiments using 77 of the datasets at the University of California, Riverside/University of East Anglia (UCR/UEA) time series classification repository [2].4
UCR/UEA TSC Repository:
Both BOP and BOSS tune their parameters through a leave one out cross validation on the train data for a predefined parameter space. The results for BOSS and BOP presented in [2] were obtained using the parameter space defined in the original papers, and these parameter spaces are different. To alleviate this possible source of bias we have altered the BOP search space to match that of BOSS (see Table 1). In a pairwise comparison between the BOP on the old and new parameter space, the latter had higher average accuracy on 44 datasets and worse on 33. There is no significant difference between the old and new versions, and we conclude that we cannot explain the difference between BOP and BOSS on this factor.
Parameter search spaces for BOP and BOSS
BOP and BOSS are identical except for four features.
The window approximation method. Each window of length
The discretisation method. Each value in the approximate series of length
The distance measure. BOP uses 1-NN with Euclidean distance whereas BOSS uses a 1-NN with a bespoke, non-symmetric distance function that ignores zero entries in the test histogram.
The classifier. BOSS is an ensemble of multiple transforms with different parameters, whereas BOP is a single classifier.
To quantify the source of the difference in BOP and BOSS, we assess the relative importance of each of these components by adding each of the four BOSS features into BOP. We then measure their importance to BOSS by in turn replacing each BOSS feature with that used in BOP. This gives us the 10 BOP/BOSS variants listed in Table 2.
BOP and BOSS variants, with the switching of BOSS and BOP features. For clarity, we indicate the variant by the addition or removal of the boss feature. e.g. BOP
Figure 3 shows the average ranks of the 10 variants of BOP and BOSS we have evaluated. Table 3 shows the mean difference in accuracy between the variants. The mean difference between the best and the worst variant is nearly 9%. To be clear, this is the absolute pairwise difference in accuracy; the worst algorithm, BOP
The mean difference in accuracy between BOP and BOSS variants over 77 datasets
Average ranks and cliques of 10 BOP/BOSS classifiers on 25 resamples of 77 data sets.
We can make the following observations from these results. For BOP, using DFT with fixed boundaries, or PAA with MCB discretisation, makes no difference to using SAX. This suggests the benefit of using spectral features only comes when using bespoke bins to discretise. Using BOSS distance for BOP histograms actually makes BOP significantly worse. The only component of BOSS that significantly improves BOP is ensembling. This actually makes BOP significantly better than a single version of BOSS (the mean difference 1.32%). However, BOP ensemble is still significantly worse than the BOSS ensemble. Hence we cannot attribute the difference to the ensembling method alone: removing any one of the four components of BOSS makes it significantly worse. The worst change to make is to switch from DFT features to PAA features. This surprised us, as we had assumed removing ensembling would cause the most harm. Clearly the four features of BOSS that differentiate it from BOP are all required and all interact to produce a better classifier. This highlights that the engineering of algorithms is often not linear, and components can interact in ways that may not be intuitively obvious. This is most clearly observable when comparing the effect of using the BOSS Distance (BD): BD makes BOP significantly worse but BOSS significantly better.
There is some difference in the structures of the resulting histograms of each transform which BOSS Distance is able to leverage over Euclidean Distance in the case of BOSS, but not BOP. Considering the actual bagging process is essentially the same between the two – both extract words from a massively sparse space, and both use numerosity reduction – such an apparently clear, or rather ‘usable’, distinction in the final histograms is striking. BOSS Distance ignores words that do not appear in the test histogram, so for BOP these missing words are seemingly informative. However, for BOSS they are noise and removing them is beneficial. Exactly why this is so is unclear. We suspect this difference is due to the action of MCB, which creates a data driven discretisation. If the underlying distribution diverges significantly from normality, MCB will create a more accurate representation and will hence capture an underlying pattern more accurately. This could lead to the truly informative words being separated from the uninformative noise more successfully, and so the words that are ignored are more likely to be noise. This is just conjecture. Further work and analysis of the resulting histograms would need to be performed to fully understand the mechanisms at work here, however this is beyond the scope of this work.
From these experiments we conclude that BOSS, as described in [26], represents the state of the art for dictionary classifiers that were first introduced in [20]. Our next question is, can we improve on the state of the art?
In Section 3 we described two approaches from Computer Vision that may improve dictionary classifiers: SIFT features [25], adapted for time series as described in [4] and Spatial Pyramids [18] that have not formerly been used in this context. We also described the Histogram Intersection as an alternative distance measure between histograms. We wish to assess whether adding these alternative structures to the state of the art dictionary classifier gives an overall improvement in accuracy. To try and isolate the causes of any observed differences we start with BOSS as the base classifier and make the minimum adjustments.
SP
Average ranks and cliques for five variants of dictionary classifiers.
Scatter plot of accuracies for (a) SP
A comparison of SP-HI to three alternative TSC approaches that are not dictionary-based.
SP
Our main interest here is in understanding and improving dictionary classifiers as a particular representation within TSC. However, for context, we compare SP-HI to three state of the art algorithms from alternative domains: a Residual Network (ResNet) with results taken from [12]; a shapelet transform with a heterogenous ensemble (ST-HESCA) [8]; and an ensemble of classifiers from the different transform domains (HIVE-COTE) [24]. We compare against ST-HESCA and HIVE-COTE as they have been shown to be significantly better than competing methods [2]. The results, shown in Fig. 6, demonstrate that SP-HI is not significantly worse than the competing individual representations. It is significantly worse than HIVE-COTE, but this is to be expected over so many data sets because HIVE-COTE ensembles over different representations, and contains BOSS as one of its components. On average, SP-HI is competitive with state of the art classifiers from different problem domains.
Dictionary classifiers are an important class of TSC algorithm that explicitly use the frequency of occurrence of repeating patterns as classification features. A previous study observed a huge difference in accuracy between two prominent approaches, BOP [20] and BOSS [26]. In order to investigate why this is so, we identify the four key differences between the two algorithms and assess their importance to both BOP and BOSS. We find that only one of these features, ensembling over different parameter values, is beneficial to both BOSS and BOP. Ensembling has proven successful in other domains for TSC [3], and it carries no train time overhead if a parameter search is being conducted. BOP with an ensemble is significantly better than a single BOSS classifier. Hence, we would recommend anyone assessing a new TSC algorithm attempt to ensemble, not least to make a better comparison to the state of the art. For example, it is quite possible that HOG-1D
However, there is more to BOSS than the ensemble. The three other distinguishing features: the use of the Fourier transform, data driven discretisation and bespoke distance measure, all have a significant effect overall. This demonstrates that algorithm design is not always a linear process; algorithm components interact in surprising ways. This is most clearly illustrated with distance measures. The BOSS distance makes BOSS significantly more accurate, but it makes BOP significantly worse. The importance of the distance function is further demonstrated with our experiments involving histogram interaction (HI) distance and two alternative dictionary classifiers, bag of temporal SIFT features (BOTSW) and BOSS with Spatial Pyramid (SP). Using HI made BOTSW significantly worse, but it improved SP (albeit not significantly).
Ensembling does have a memory overhead, as each base classifier must be stored. This is particularly memory intensive for histogram-based nearest neighbour classifiers such as BOP and BOSS, and it would be useful to have an algorithm that did not require storing all the histograms in the ensemble. We have experimented with using alternative less memory intensive base classifiers such as C4.5, but this significantly reduced accuracy and massively increased the time to build the classifier. A SVM approach may yield a better classifier with lower memory overhead, but our preliminary experiments showed that the extra training time made this infeasible for a large number of problems. However, it is possible to pursue this further, perhaps through using a condensed data set and/or a proxy classifier for parameter search.
The new approach to dictionary classifiers that combine temporal and dictionary features by using Spatial Pyramids in conjunction with BOSS and HI is significantly more accurate than the standard BOSS ensemble. However, the improvement is small and mostly on problems where BOSS is not the best algorithm, so it is debatable whether the extra memory overhead required by the spatial pyramid is worth the small improvement. We believe the SP approach will be best when discriminatory shape frequency features are embedded in confounding noise. In this situation, the pyramid will facilitate higher pattern resolution in certain areas of the data. Other techniques may also improve classification for certain data, although this has yet to be conclusively shown.
The challenge for dictionary-based classifiers is to form a qualitative understanding of the type of problems that best suit this approach and to back this understanding with experimental evidence. For example, we could argue that dictionary classifiers will be a good choice of algorithm for classifying long EEG series. This seems reasonable, given BOSS is based on frequency of repetition patterns, but we have no evidence that this is actually the case. It will then be much easier to quantify whether further possible refinements based on techniques used in other fields actually improve accuracy on data for which it is sensible to use a dictionary classifier.
Footnotes
Acknowledgments
This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC) [grant number EP/M015807/1] and the Biotechnology and Biological Sciences Research Council (BBSRC) Norwich Research Park Biosciences Doctoral Training Partnership [grant number BB/ M011216/1]. The experiments were carried out on the High Performance Computing Cluster supported by the Research and Specialist Computing Support service at the University of East Anglia and using a Titan X Pascal donated by the NVIDIA Corporation.
