Abstract
Daily human activity recognition using sensor data can be a fundamental task for many real-world applications, such as home monitoring and assisted living. One of the challenges in human activity recognition is to distinguish activities that have infrequent occurrence and less distinctive patterns. We propose a hierarchical classifier to perform two-phase learning. In the first phase the classifier learns general features to recognise majority classes, and the second phase is to collect minority and subtle classes to identify fine difference between them. We compare our proposal with a collection of state-of-the-art classification techniques on four real-world third-party datasets that involve different types of object sensors and are collected in different environments and on different subjects and six imbalanced datasets from the UCI-Irvine Machine Learning repository. Our results demonstrate that our hierarchical classifier approach performs better than state-of-the-art techniques including both structure- and feature-based learning techniques. The key novelty of our approach is that we reduce the bias of the ensemble classifier by training it on a subspace of data, which allows identification of activities with subtle differences, and thus provides well-discriminating features.
Introduction
The European Commission has predicted that by 2025, the United Kingdom alone will see a rise of

An illustration of the problem and challenge of recognising minority and subtle activities.
Ambient sensors, including positioning or pressure sensors and RFID sensors, can be deployed to detect the whereabouts of older adults and their interactions with everyday objects. With the support of intelligent algorithms, we aim to infer their current activities (e.g., making a meal or performing personal hygiene) and also detect changes in their health conditions over time. This would enable timely intervention by sending alerts to users, family members, and/or caregivers [9]. There is evidence that some adverse health conditions can be prevented and controlled if detected in a timely manner [1]. This situation reinforces the motivation for developing preventive methods to guarantee a better quality of life.
Recent advance in data mining, machine learning, and deep learning have made it possible to learn complex correlations between low-level sensor data and high-level activities, but it remains challenging to distinguish activities that have both subtle differences and imbalanced distributions, since these can have significant implications in health-related applications. For example, life-threatening situations such as fall, strokes or heart attacks are infrequent and may have subtle differences when compared to the sensor data for other daily activities. Effective recognition of these incidents is central to the robustness of any activity recognition system.
To illustrate the challenge of recognising minority and subtle activities, we use the following example. Figure 1(a) presents the distribution of a set of concurrent activities from two users recorded in a smart home setting [5] and activity recognition accuracies (measured in F1 scores) from a Support Vector Machine (SVM) with a RBF kernel. This example demonstrates that the SVM RBF can reliably recognise the majority activities such as “Sleep” and “Work” and the activities with distinct patterns such as “Bed_Toilet_Transition”. However, it performs poorly on (i) distinguishing the activities from the same user occurring in the same area; for example, is a user wandering or working in the bedroom?, and (ii) differentiating between the users for the same type of activities performed in a public area; for example, is the user R1 or R2 leaving the house? Some activities do not occur often, especially the leaving and entering home activities only occur 1% on R1 and 0.05% on R2. Hence there are too few samples to train a reliable classifier, and also learning their discriminative features can be challenged by the majority classes. Secondly, these activities can have less discriminative patterns from their majority counterpart; that is, they might activate the same set of sensors but with little difference in distributions. Figure 1(b) visualises the sensor features for these four activities on a 2D plot using t-distributed Stochastic Neighbor Embedding (t-SNE). As we can see, these activities are quite mixed together and there is no clear boundary between them.
In this paper, we hypothesise that learning good feature representations can help recognise minority and subtle activities. In particular, we address two research questions: what constitute good feature representations?, and how can they be learnt? To address these, we explore the recent representation learning techniques and focus on Dissimilarity Representation (DR) that has achieved promising results in structural pattern recognition in computer vision [7]. We propose a Dissimilarity Representation based Hierarchical Classifier (DRHC) with the aim to learn discriminative features in order to better distinguish minority activities with less distinctive patterns. We have evaluated our technique on third-party datasets and have demonstrated its effectiveness by comparison with (i) state-of-the-art classification techniques, (ii) resampling techniques that target at imbalanced datasets, and (iii) other representation learning techniques.
The rest of the paper is organised as follows. Section 2 introduces the existing literature in activity recognition. Section 3 introduces dissimilarity representation and proposes DRHC. Section 4 describes the evaluation methodology and Section 5 presents the evaluation results and discusses the performance of DRHC over the other state-of-the-art classifiers. The paper concludes in Section 6.
Human activity recognition aims to develop methods to understand human behaviour from a series of observations derived from motion, location, physiological signals and environmental information. A general process in human activity recognition is to collect and integrate data from various sensors, extract features, and apply a learning technique to infer activities from the features.
State of the art of sensor-based human activity recognition
Various data- and knowledge-driven techniques have been applied to human activity recognition, including ontological reasoning, Naive Bayes, Decision Trees, Hidden Markov Models (HMM), Conditional Random Fields (CRF), Neural Networks, and Support Vector Machines (SVM) [4,38].
Ye et al. [39] have applied ontologies to support automatic segmentation of sensor data for multi-user concurrent activities. They have employed the Pyramid Match Kernel to separate the activities with similar patterns to a certain degree. This is achieved by calculating the difference of sensor feature distributions in a hierarchical manner. However they still cannot distinguish the users for the same activities; for example, identifying which user is cooking.
van Kasteren et al. [34] apply HMM to model sequential relationships of sensor data and activities. The HMM is trained to obtain three probability parameters, where the prior probability of an activity represents the likelihood of the user starting from this activity; the state transition probabilities represent the likelihood of the user changing from one activity to another; and the observation emission probabilities represent the likelihood of the occurrence of a sensor observation when the user is conducting a certain activity. Even though the HMM has well built the temporal probabilist model between activities and sensor observations and thus successfully recognised activities, but it has achieved low accuracies on minority activities [33].
More recently, convolutional neural networks (CNNs) have become a popular approach to automatically extract features from low-level sensor data. Morales et al. [22] introduce transfer learning in wearable activity recognition using CNNs. Kernels in CNNs are supervised neural networks that can act as feature extractors by stacking several convolutional operators to create a hierarchy of progressively more abstract features. CNNs learn a hierarchical representation of the data and have the ability to identify salient patterns in the signal with deeper layers. They have applied transfer learning technique to learn model parameters for a classification task by incorporating data from a different but related classification task. They analyse kernel transfer between users in the same application domain (i.e., the same dataset), between applications domains, between sensor locations and between sensor modalities.
These techniques have demonstrated promising results in learning complex correlations between human activities and sensor features. However, few of these existing techniques have focused on learning good representations of sensor data so as to further distinguish activities that have subtle difference.
Representation learning
Representation learning has become a crucial task in machine learning. It can be either linear or nonlinear, either supervised (i.e., features are learned using labelled input data), or unsupervised (i.e., features are learned with unlabelled input data). Traditional feature learning aims to learn transformations of the data that make it easier to extract useful information when building a classifier [41]. Within this group, the most popular feature learning is Principal Component Analysis (PCA). This linear unsupervised algorithm transforms feature variables into a smaller number of uncorrelated variables called principal components. Another well-known linear supervised algorithm is linear discriminant analysis (LDA), which finds a linear combination of features that separates two or more classes of objects. It has been successfully used in face recognition [41]. Unlike these approaches, manifold learning is a nonlinear method that learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications [2].
Although, little research has addressed the problem of representation learning for human activity recognition. Plotz et al. highlight the idea of feature learning, which focuses on two learning techniques: Principal Component Analysis (PCA) and Autoencoder [28]. In the context of activity recognition, PCA can perform poorly because it can miss important nonlinear structures of the data. To tackle this problem, they propose an alternative raw data representation based on the empirical cumulative distribution function of the sample data. Furthermore, Mannini et al. propose the Pudil algorithm based on a sequential forward-backward floating search [21], which is a feature selection method to detect and discard the features that are demonstrated to make minimal contribution to a correct response from the classifier.
Feature selection
The goal of feature selection methods is to find a subset of optimal features to obtain high classification accuracy by analyzing low-dimensional data. Feature selection algorithms not only address the issue to find a good data representation to improve the recognition of minor activities with less distinctive patterns but also reduce the computational costs by removing redundant information.
David et al. [15] propose a novel hybrid feature selection algorithm based on the niche overlapping coefficient (FeSNOC). Their proposal combines k-Nearest Neighbour algorithm and the overlapping coefficient to find the best features. They use the niche overlapping coefficient to estimate the overlapping between classes and use this measure as a measure of similarity between two activity classes. FeSNOC consists of four steps. In the first step, FeSNOC uses k-Nearest Neighbour algorithm to calculate the classification accuracy using each feature. Then, it computes the overlapping coefficient for each feature and uses a linear relationship between the accuracy and overlapping coefficient. Based on the fit function obtained in this step, FeSNOC ranks and selects the features. David et al. [15] have shown that their proposal outperforms other filters and hybrid approaches such as Information Gain, One R, Gain Ratio, Chi-Square, and Relief-F.
Wang et al. [35] introduce a discrimination index based on neighborhood cardinality rather than neighborhood similarity classes to measure the uncertainty quantity of the distinguishing ability of a feature subset. This index has similar properties to Shannon entropy. They define joint discrimination index, conditional discrimination index, and mutual discrimination index. These measures are used to calculate the change of distinguishing information caused by the combination of multiple feature subsets. The conditional discrimination index is used to characterise the ability of a subset of features to distinguish samples with different decisions. The idea behind this index is that the smaller the conditional discrimination index, the greater the distinguishing ability of the feature subset. Guo et al. [11] propose to concatenate frequency-based sensor features with TF-IDF features, which demonstrates an improvement over frequency-based features alone.
Xiang et al. [37] use the least square regression (LSR) to enlarge the distance between different classes. To achieve this, they introduce a new technique called ϵ-dragging to force the regression targets of different classes moving along with opposite directions.
In the last years, many new techniques have been develop in machine learning that have proven successful in human activity recognition in smart home environments. Oukrich et at. [25] introduce a Multilayer Perceptron model made up of three layers that uses back propagation algorithm to recognise activities of daily living in smart home. Then they use minimal-redundancy-maximal-relevance criterion method for feature selection of observed motion and door sensors.
At this point, we have discussed about feature selection to remove redundant features while achieving good classification performance. Other way to reduce dimensionality is feature transformation. In contrast with feature selection, feature transformation methods transform the original features to a new feature subspace. Tao et al. propose a feature selection method that combines LDA and sparsity regularization. They proved that extending the
Imbalanced class distribution
Activity data collected in the real-world environments, as presented in Fig. 1(a), can often have imbalanced distributions. This issue occurs when the number of instances one activity is much lower than the ones of the other activities. This problem has yet attracted sufficient attention from researchers.
Due to the importance of this issue, a large amount of techniques have been developed to address the problem. Feuz et al. [8] propose intra-class clustering (ICC) technique to learn from imbalanced classes without changing data distribution. ICC decomposes a large majority class into smaller sub-classes by clustering, which leads to a more balanced distribution. This technique is applied before training the classifier. Each class or classes are individually decomposed into sub-classes, each instance of which will be assigned a new class label. This new set of training data is then used to build a classification model. They have designed different strategies of selecting the number of clusters and determining labels for decomposed classes. Their evaluation have demonstrated that creating a more balanced class distribution leads to improved classifier performance. Adding new classes creates new decision boundaries, which improves the performance of classifiers of high bias classifiers like Naive Bayes. This work is most similar to ours in terms of dealing with skewed class distribution. The main difference is that we focus on minority and subtle classes and also instead of separating the classes into more balanced sub classes, we apply a hierarchical approach to deal with majority and minority classes at different levels.
The performance of classification algorithms can be greatly affected when the dataset is highly imbalance. A large number of different resampling techniques have been proposed in the literature to deal with the class imbalance problem. Galar et al. [10] have categorised resampling techniques into three groups: undersampling methods, which create a subset of the original data set by eliminating instances; oversampling methods, which create a superset of the original data-set by creating new instances from existing ones; and hybrids methods that combine both sampling methods. More et al. [23] have compared different techniques with respect to their effect on the recall on the minority class and the precision on the majority class. They have used a synthetic dataset with 1000 instances and two classes. They conclude that combining SMOTE and Edited Nearest Neighbours (ENN) with as logistic regression classifier and BalanceCascade give the best performance. Guo et al. [12] have presented an improved SMOTE algorithm to address the imbalance problem, which uses the Euclidean distance of each minority class to adjust the distribution of all the classes, and generates new synthetic minority classes in the neighbourhood of remaining minority-class examples.
Hierarchical classifiers
Ensembles and hierarchical classifiers are often used to recognise complex activities. Ensemble classifiers are known to increase the accuracy of single classifiers by combining several of them. Galar et al. [10] review the state of the art on ensemble techniques in the framework of imbalanced data sets with a particular focus on the binary classification problem. They proposed a new framework to classify ensemble-based methods in a new category depending on how they deal with cost-sensitive and data preprocessing level before training the classifier.
Nguyen et al. have designed a hierarchical HMM (HHMM) to recognise primitive and complex behaviours of multiple people [24]. They construct a unified graphical model composed of a set of HHMMs with data association. Banos et al. [1] present a fusion classification approach called Hierarchical-weighted classification (HWC). This model combines hierarchical decision (HD) technique and majority voting (MV). HD the classifiers’ decision are made in strict order of classification capabilities. It gives more importance to those classifiers which generally perform better. The MV is a democracy-based model where all the classifiers have the same opportunity to take a decision. The HWC is composed by three classifications levels. Each classifier has the same opportunity of collaborating on the final decision, but ranking the relative importance of each one through the use of weights based on the individual performance of each classifier. Their model outperforms other multiclass approaches and improves the scalability and robustness with respect to other traditional fusion techniques.
Minority and subtle activity recognition
Problem statement
Recognising everyday routine activities can be challenging, as it involves understanding human behaviour from complex interactions between diverse sensor signals. Similar to [8], we list a formal definition of the terms in Table 1, based on which we define the problem of interest – recognising minority and subtle activities.
Symbols and annotations
Symbols and annotations
Let
For example in Fig. 1(a), if we consider the threshold θ as 0.5, then the activity ‘R1 leave home’ is considered as a minor class as the ratio of its instances to the averaged instances of all the classes is 0.2, while the activity ‘R1 Sleep’ is considered not as a minor class as its ratio is 5.04. A subtle activity is an activity that has a similar pattern to some other activities. There are different ways of characterising pattern representations and assessing the similarity between them. For example in Table 2, if we take an intuitive way – calculating the Euclidean distance between the centre points of two activity classes, and set the threshold δ as 0.1, then we can consider the four activities of leaving and entering home of both users as subtle, as their distances are only about 0.001. The thresholds can be configured differently to suit the characteristics of datasets and the requirements of the applications.
Distance matrix between subtle activities
Dissimilarity representation (DR) represents data as the difference between two objects. It is proposed as a more flexible representation than feature representation, with the purpose of having more information about the structure of the objects. The idea behind DR is that objects are given the same class label if their difference is sufficiently small. Hence, it should be easier for the classifiers to discriminate between them. A more formal definition is given as follow [7]:
Given a representation or prototype set
The prototype set R is generally a subset of the training set T. The key idea of prototype selection is to find representative instances from training set. The most common approaches are clustering techniques and learning vector quantisation (LVQ) algorithm [16]. After prototype selection, the original feature space will be mapped to a dissimilarity space where each object is represented as a dissimilarity vector

DRHC workflow.
We can train a classifier on the converted dissimilarity representations, which is dedicated to learn differences to separate objects in different classes. It is different from feature representation based classification that aims to learn the correlations between features and classes. We hypothesise that learning the difference between classes can better characterise distinctive patterns of activities and thus achieve higher recognition accuracies.
As introduced above, dissimilarity representation can help learn discriminative feature representations, but the problem of imbalanced class distributions remains: the prototypes selected might represent varied patterns for majority classes, while the minority classes might either sit at the boundary of prototype clusters and be considered as noise, or be identified into a small number of prototypes. Either way, existing classifier techniques cannot learn them effectively. To address this problem, we introduce a hierarchical classifier that performs two-phase learning, where the first phase is to learn general features in order to recognise majority classes, and the second phase is to collect minority and subtle classes to identify any fine differences between them. Figure 2 illustrates the main workflow of our approach.
For the prototype selection, we apply a clustering algorithm to each activity separately, and select the centre of each cluster as a prototype. Once the prototypes are selected, we compute the pairwise dissimilarity matrix
Finally on each cluster

Dissimilarity hierarchical classifier algorithm
For classification, given an unlabelled sensor feature vector, DRHC will first calculate the probability class distribution using the classifier in the first phase. We choose the class with higher confidence score. Then, the unlabelled sensor feature vector will go through the second classification phase, were it is evaluated by each classifier
We hypothesise that DRHC algorithm can significantly improve the accuracies of recognising minority activities with less distinctive patterns by learning good representations of sensor data. More specifically, we are mainly interested in the following three questions:
Does DRHC outperform the state-of-the-art classifiers in recognising minority and subtle activities? Does DRHC outperform the existing sampling techniques at targeting minority activities? Does DRHC outperform the existing representation learning techniques in learning features?
Selection of datasets
We test our algorithm on two types of data taken from collections available to the entire research community: smart home and general machine learning datasets. To evaluate performance as a classification method and demonstrate the generality, we use six imbalanced datasets from the UCI-Irvine Machine learning repository.1
We additionally examine the performance of our algorithm using the Interleaved ADL dataset from the CASAS smart home project from Washington State University [5], referred to as WS. This dataset was collected from a student apartment testbed during the 2009–2010 academic year. The apartment was instrumented with various types of sensors to detect user movements, interaction with selected items, the states of doors and lights, consumption of water and electrical energy, and temperature [5]. This dataset recorded 13 activities performed by 2 individuals. We use a semantic approach to separate sensor data for concurrent activities [39]. There are two main goals of our algorithm on this dataset: (1) distinguishing two users for the same type of activities performed in a public area; for example, whether user R1 or R2 is watching TV, and (2) distinguishing one user’s activities performed in the same area; for example, is the user sleeping or wandering in a room. These two types have been demonstrated as a challenging problem in multi-user concurrent activity recognition [39].
Given that all the datasets have imbalanced distribution of activities, class-based precision, recall and F1 score are taken as an indicative of the performance of an algorithm [8]. For comparative purposes, F1 score represents the trade-off between Recall and Precision.
The F1 score is the harmonic mean of precision and recall:
Technique and parameter setup of DRHC
DRHC can be configured with any appropriate distance metric, clustering technique, and ensemble classifier. There is no a generally agreed distance metric. Some dissimilarity metric are more discriminative than others. It depends on the complexity of the dataset. Finding a well-discriminating dissimilarity measure is difficult.
There might exist plausible dissimilarities which are defined on different representations and reflect different aspects of the data [36]. We are interested in finding out whether using multiple distance metrics can help capture subtle differences between activities. We hypothesis that training the classifiers in different dissimilarity representations and using an ensemble approach to take the final decision will improve activity recognition accuracy over the basic DR-based classifier. We consider the following commonly used distance measures from the literature, including Kullback–Leibler (KL) divergence, Mahalanobis, Cosine, Euclidean, and Bray–Curtis [27].
Let u and v be n-dimension vectors.
KL divergence. Kullback–Leibler divergence is defined as
Mahalanobis. The Mahalanobis distance between two points is
Cosine. The cosine measures the cosine angle between two data points. Cosine similarity is particularly used in the positive space, bounded in
Bray–Curtis. The Bray–Curtis distance computes the compositional dissimilarity between two different sites based on the counts of items in each site, which is commonly used in biology [3]. Here we use it to quantify the difference of activation frequencies of sensors on two activity classes. It is defined as
We have experimented all the above metrics, and prototype generation algorithms including the traditional LVQ, KMeans and DBSCAN clustering algorithms. Among them, the best results are achieved with cosine and DBSCAN, which are reported in the following section.
We have also experimented with different techniques as the base classifier, including SVMs with the linear and RBF kernels, Naive Bayes (NB), K Nearest Neighbour (KNN), Decision Tree (DT), and Random Forest (RF). Each of these techniques has demonstrated promising results in activity recognition [38]. In our experiments, the SVM with the RBF kernel and Random Forest have performed the best. For the sake of computation performance, we select SVM RBF as the base classifier for DRHC and for all the others.
We choose a combined resampling technique – SMOTE (Synthetic Minority Over-sampling TEchnique) followed by Tomek link [23]. That is, we first over-sample minority class instances by creating synthetic examples as close as to their nearest neighbors, and then remove the majority class instances that are part of a Tomek link. A pair of instances is called a Tomek link if they are each other’s nearest neighbours but belong to different classes. We have compared the performance of this combined sampling technique with the other techniques including SMOTE, Edited Nearest Neighbour (ENN) [13], and Repeated Edited Nearest Neighbour (RENN) [23], in which the ENN algorithm is applied successively until it can remove no further points. This combined technique has achieved the best performance.
Comparison process
We performed four stages of comparative evaluation. The first compared DRHC with four alternatives that are considered before proposing our final algorithm.
Without Subspace – where instead of clustering misclassified instances, we apply a hierarchical ensemble on the whole set of misclassified instances to assess whether gathering similar instances together will help differentiate them. The assumption is that applying a classifier on each group of misclassified instances will allow for better discrimination of features to separate them.
Without Dissimilarity Representation – where instead of generating the dissimilarity representation, we train the classifier on the sensor feature representation to assess whether the dissimilarity representations approach is necessary.
Without Resampling – where instead of resampling the misclassified instances, we apply the classifier on each cluster to assess whether resampling is necessary to improve the performance of the classifier.
Stage two involves comparison of DRHC against baseline classifiers, with and without Dissimilarity Representation incorporated into the underlying data. To achieve this, we use the same DBSCAN algorithm to generate prototypes and then employ the above mentioned baseline techniques to train on the converted dissimilarity representations. In particular, we compare DRHC to SVM RBF with the “class weight” parameter set to “balanced”. This automatically adjusts weights to be inversely proportional to class size, and has demonstrated promising results in both the handling of imbalanced classification problems and in activity recognition in previous studies [8].
In the third stage DRHC performance is compared with the techniques focus on imbalanced class distribution. These include resampling techniques (1) SMOTE, (2) Edited Nearest Neighbour (ENN) [6], (3) Repeated Edited Nearest Neighbour (in which the ENN algorithm is applied successively until it can remove no further points) (RENN), and (4) a more recent technique, called intra-class clustering (ICC), where instances are clustered and candidate labels are generated to enforce a balanced class distribution within each cluster [8]. For each of these options we take training data with sensor dissimilarity representations and then train an ensemble SVM RBF balanced model. This controls for variability in performance due to the specific classifier used.
DRHC compared to variants of our algorithm
DRHC compared to variants of our algorithm
The first three stages are designed to assess the DRHC algorithm against (i) plausible variants of itself, (ii) commonly-used classifiers (with limited or no emphasis on the classification of minor and subtle activities), and (iii) more advanced and specialised classification techniques that might be considered suitable for our purposes. The fourth and final stage is to assess DRHC performance against current state-of-the-art representation learning techniques [2,41]. As for stage 3, we control for classifier variability by using SVM RBF balanced as the underlying classifier. A key aspect of any activity recognition task is an appropriate feature representation of the sensor data, and the design of suitable classifiers [16]. We consider and assess three types of unsupervised feature learning techniques:
PCA, Principal Component Analysis [17], one of the most popular techniques in learning correlations of features. It is a linear unsupervised algorithm that transforms feature variables into a smaller number of uncorrelated variables called principal components.
t-SNE, t-Stochastic Neighbour Embedding, is a nonlinear technique for dimensionality reduction that minimises the divergence between the distributions [32]. It constructs a probability distribution over pairs of high-dimensional data points in such a way that objects from the same class have higher probability of being picked than those belonging to different classes. Then, it defines a similar probability distribution over the data points in the low-dimensional map, and minimises the Kullback–Leibler divergence between the two distributions with respect to the locations of the data points in the map.
Autoencoder, an unsupervised representation learning technique based on neural networks, which has attracted increasing attention in the deep learning [20] community. This technique learns a function
NOC, a recent feature selection technique based on Niche Overlapping Coefficient [26], which has achieved promising results in learning discriminative sensor features [15].
At each stage, 100 iterations for 5-fold cross validation are performed on each dataset. For each iteration, we calculate the mean class- and instance-F1 scores for DRHC and the comparison techniques. We test the null hypothesis that DRHC will produce higher accuracies than the other alternatives using one-sided (greater) paired Welch’s t-tests on the class-F1 scores across iterations, with significance levels for p-values set at 5%. All calculations were performed in R version 3.3.2 [14].
Results and discussion
In this section we will discuss the evaluation results and validate that our proposed algorithm finds a suitable data representation that can improve activity recognition accuracies. In each results table, entries are averaged class- and instance-F1 score over 100 iterations plus or minus standard deviation, with best performance for a given dataset shown in bold. Starred entries denote statistically significantly inferior performance compared to DRHC. WS is the CASAS home sensor data; HA, HB and HC are the Amsterdam data; the remaining rows are unbalanced classification calibration datasets from the UCI machine learning repository.
We summarise the main findings in terms of the three questions proposed in Section 4:
Does DRHC outperform the state-of-the-art classifiers in recognising minority and subtle activities? Existing classifiers can learn common patterns from majority classes well. They, even less sophisticated classifiers like Naive Bayes, have achieved a good classification accuracies for well-represented activities, whereas they all tend to misclassify minority and less distinctive activities. DRHC leverages dissimilarity representation and highlights fine and discriminative features between these activities, leading to significantly improved accuracies. Does DRHC outperform the existing sampling techniques in targeting minority classes? Sampling techniques can help to recognise minority classes, as with them the base classifier’s performance has been improved to a certain degree. However, still for minority and subtle activities, the sampling techniques alone cannot help. DRHC has leveraged the sampling techniques and combined with dissimilarity representations, which has demonstrated a better capability in learning discriminative features between them. Does DRHC outperform the existing representation learning techniques in learning features? DRHC outperforms the state-of-the-art representation learning techniques. Even though some techniques like t-SNE clearly finds a good representation of the data, they are not able to recognise subtle differences and the classifier is biased by the majority classes.
Variations of DRHC
The results for the first stage of empirical evaluation are given in Table 3. DRHC is compared to three candidate methods that were considered during algorithm development, resulting in significantly improved F1 score performance in 23 of 30 comparisons (

An example of highly similar prototypes in HB.
Difference between DRHC and ‘No DR’ is not significant, which is mainly due to the effectiveness of identified prototypes. The problem remains of how to best to separate prototypes when we have activities that activate the same set of sensors, with the difference in their sensor distribution being almost undetectably small. Such subtle differences can challenge prototype selection. Future investigations include the design and evaluation of prototype selection and distance metrics to better represent and separate subtle difference in these distributions.
A DR-based approach can work effectively if there exists well-defined and separated prototypes for each class, but this is not the case for the problem that we aim to address. We illustrate the challenge through an example in Fig. 3. Here we list two prototypes
Sensor noise, like
DRHC compared to benchmark classifiers on original features (class-F1 vs. instance-F1 scores)
We also compare the current design of DRHC with our previous work [30], and the current design has shown an improvement in 7 of 10 datasets, and comparable performance in 3 of the remaining experiments. We believe that applying a classifier on each cluster helps learn the small differences between activities that activate similar sensors that are more difficult to learn when trying to separate multiple and different activities.
As we can see, performing clustering on misclassified instances and assigning a classifier on each cluster achieves higher accuracy than only using a set of one-class classifier. The key reason is that clustering gathers instances that come from different classes but have similar patterns; i.e., small distance between their feature vectors, and thus the classifier on a cluster is guided to look for difference in features to separate the instances. In comparison, the one-class classifiers will still focus on learning the boundary of each class, but for activities that have similar patterns, their boundaries can overlap, which does not help to distinguish them.
In the following, we will compare DRHC with the state-of-the-art classifiers, and in each result table we list both the averaged class- (top) and instance-F1 scores (bottom). Because we deal with imbalanced datasets, we focus our discussion on class-F1 scores.
Table 4 contains results for the initial phase of the second stage of empirical evaluation, in which DRHC is compared to the commonly-used classifier techniques including KNN, NB, SVM RBF, and SVM RBFB, without taking dissimilarity representation into account. DRHC exhibits significantly superior performance in 31 of 40 comparisons (
Predicting human activity of WS
Predicting human activity of WS
Table 5 presents the comparison of averaged F1 scores between DRHC and SVM RBF B on each activity in the WS dataset. We can observe that even with the ’balanced’ option, SVM RBF still performs worse on recognising the minority classes such as ‘Watch TV’, ‘Eat’, ‘Bath’, ‘Leave Home’ and ‘Enter Home’, or distinguishing subtle patterns between imbalanced activities like ‘R1 Sleep’, ‘R1 Work’ and ‘R1 Wander in Room’, all of which are occurring in the same room and will often activate the same set of sensors. Figure 4 shows the sensor feature distribution on these three activities. The imbalanced distribution makes learning the difference more challenging; that is, the activity ‘R1 wander in Room’ only takes 0.05% of the whole dataset while the other two activity classes dominate the dataset; i.e., 24% and 12%. In comparison, DRHC outperforms most of the state-of-the-art techniques and leads to significantly improved overall F1 scores. With the two-phase learning, especially the second-phase of learning in DRHC, we can look into discriminative features that well separates ‘R1 Wander in Room’ from the other two.

Sensor feature distribution on the activities ‘R1 Sleep’, ‘R1 Work’, and ‘R1 Wander in Room’.
DRHC compared to benchmark classifiers on DR augmented data (class-F1 vs. instance-F1 scores)
When comparing the results with and without dissimilarity representations on the state-of-the-art techniques between Table 4 and 6, the performance varies between datasets. On the smart home datasets, sensor and dissimilarity representations achieve comparable F1 scores within 5% deviation. On the 6 machine learning datasets, the difference is slightly more observable, especially on the last 3 datasets where DR can enhance the F1 scores over 20%; i.e., Nursery. Again this is still due to the nature of the datasets – whether it is possible to identify effective prototypes.
DRHC compared to resampling techniques (class-F1 vs. instance-F1 scores)
DRHC compared to resampling techniques (class-F1 vs. instance-F1 scores)
Table 7 summarises our findings when comparing the DRHC data to classifiers based on the resampling techniques SMOTE, RENN and ENN, and ICC in Section 4.4. The same ensemble classifier, SVM RBF balanced, is used in each comparison. DRHC exhibits significantly superior performance in 36 of 40 comparisons (
Comparing Table 7 and the last column in Table 4, we can see that SMOTE, RENN and ENN improve the F1 scores on most of the smart home datasets and some of the machine learning datasets. Especially, SMOTE consistently outperforms the other sampling techniques. The reason is that ENN and RENN undersample the majority classes by removing data points. The more imbalanced the data set is, the more samples will be discarded when using these techniques, therefore throwing away potentially useful information.
Still DRHC produces higher recognition accuracies than SMOTE. One reason might be that the sampling technique could generate potentially misleading information through oversampling the minority class [8]. SMOTE might introduce instances that do not add any information about the minority classes which can be consider as noisy instances rather than true representation of them.
It is worthy of noticing that ICC outperforms DRHC on HA and HB datasets but produces much lower F1 scores on WS, HC, and the other machine learning datasets. As mentioned in 2.4, ICC decomposes majority classes into smaller sub-classes before training the classifier, this process creates more decision boundaries to separate the different classes which increases the classification performance but eventually may lead to over-fitting. Table 8 compares F1 scores on DRHC and the sampling techniques on the HA dataset where the sub-classes boundaries created by the clustering process improve the classification performance.
Table 9 summarises our investigations into whether the representation learning techniques PCA, t-SNE, Autoencoder, and NOC can be used to improve activity recognition when compared to our DRHC algorithm. DRHC exhibits significantly improved predictive performance in 39 of 40 comparisons (
PCA performs the worst and their poor performance might indicate that compressing the data loses meaningful information of the classes leading to a very low F1 score. In addition, we need to retain 99% of the variability in order to have a good representation. This means that we need to preserve almost the same number of feature vectors so that the classifier could distinguish between activities. The results using PCA are not very outstanding and its poor performance is consistent with the literature [19], which suggests that PCA misses important nonlinear structures of the data.
t-SNE transforms the input features vectors into 2 or 3 dimensions, which has been widely used in visualising high-dimensional data. The features learnt from t-SNE can well separate some classes, but not for classes with little difference. t-SNE technique is able to learn good features to separate classes, nevertheless the classifier is biased by the majority classes in each cluster, which still results in the poor recognition accuracies on the minority classes. For example, Fig. 1(b) plots the most common misclassified activities of the WS dataset in a two-dimensional scatter plot, including ‘R1 Leave Home’, ‘R1 Enter Home’, ‘R2 Enter Home’, and ‘R2 Leave Home’.
Autoencoders have been widely used in speech recognition, image classification, and face recognition [20], where it has achieved promising results in compressing data by learning linear and nonlinear relationships between features. However, it is not able to differentiate activities with less distinctive patterns. We have tried to configure the autoencoder with different parameters, such as different numbers of layers, different numbers of neurons, and various optimisation functions. No set of parameters significantly improves the classification accuracy indicating that autoencoder fails in representing noisy data with few spare feature vectors. However, we only experiment a standard sparse autoencoder and with more sophisticated autoencoders such as variational autoencoder [29] the performance can be improved. But this attempt is out of the scope of this paper, and we will look into it in the future.
Discussion
The specific number and sequence of sub-techniques used in DRHC is justified by empirical evaluation (Table 3). Failure to implement resampling at the final stage of the algorithm leads to statistically significantly inferior performance in all cases. The situation is less clear-cut at the initial prototyping stage of the algorithm. The replacement of clustering by hierarchical sampling (both based on misclassification of instances for data that have not yet been augmented with dissimilarity representations) led to improved F1 score performance for two of our ten datasets, and omitting the dissimilarity representation stage also led to increase average F1 score for two of ten datasets. These results may be due to statistical fluctuation in the underlying data, but it may be the case that practitioners seeking to identify subtle and minority activities from sensor data should consider these modifications to the algorithm as described in Algorithm 1 in the event of unsatisfactory initial performance.
When compared to commonly-used classifiers, DRHC is a significant improvement, irrespective of the type of data (sensor-based or unbalanced benchmark) and whether or not dissimilarity representation is performed as a pre-process (Table 4 and Table 6). These results (71 of 80 comparisons a significant improvement in predictive ability) provide empirical validation for DRHC, demonstrating that the new algorithm achieves its intended purpose.
After selection of the balanced SVM with radial basis as the underlying ensemble classifier, we also report strong empirical evidence that DRHC outperforms implementations that employ alternative methods of resampling and relabelling.
Predicting human activity of House A
Predicting human activity of House A
DRHC compared to representation learning techniques (class-F1 vs. instance-F1 scores)
In our algorithm dissimilarity representations are derived to use when clustering misclassified instances from a baseline classifier. The effectiveness of dissimilarity representations depends on the selection of prototypes; i.e., whether the prototypes are representative of the datasets. We have employed clustering algorithms to select the centroids of each cluster as prototypes, which can be affected by the amount of training data being used. If there exists small size of training data, then the prototypes might not be representative, which can impact on the dissimilarity representation.
A plausible alternative approach would be to use standard unsupervised representation learning techniques to form the clusters. Our experiments suggest that this is always an inferior alternative, leading to statistically significantly inferior predictive performance in 30 of 30 tests (with the same ensemble classifier being used for each comparison). This (i) indicates that dissimilarity representations are crucial to the performance of predictors of subtle and infrequent activities, and (ii) provides strong evidence for the utility of DRHC as a proven advantage over existing alternatives.
However, difference between dissimilarity representations and normal feature representations is not significant, and we reject our hypothesis that the dissimilarity representation is more effective in distinguishing subtle differences between classes. This is mainly due to the effectiveness of identified prototypes. The problem remains of how to best to separate prototypes when we have activities that activate the same set of sensors, with the difference in their sensor distribution being almost undetectably small. Such subtle differences can challenge prototype selection. Future investigations include the design and evaluation of prototype selection and distance metrics to better represent and separate subtle difference in these distributions.
In this paper, we present a new technique based on dissimilarity representation, which leverages a multi-phase, hierarchical ensemble in recognising minority and subtle activities. A sequence of comprehensive empirical evaluation and comparison demonstrates that (i) this is a challenging task where existing structure- and feature-based learning techniques do not perform well in general, and (ii) our DRHC algorithm constitutes a significant improvement on existing methods. The key novelty of our approach is that we reduce the bias of the ensemble classifier by training it on a subspace of data with less noise so that the classifier could learn from minority activities and hence reliably identify well-discriminating features.
The problem remains of how to best to separate prototypes when we have activities that fire the same set of sensors, with the difference in their sensor distribution being almost undetectably small. Such subtle differences cannot be captured by the baseline classifiers we have employed to date. Future investigations include the design and evaluation of prototype selection and distance metrics to better represent and separate subtle difference in these distributions. So far, we have only considered static sensor data (e.g. doors opening and closing; motion sensors firing; lights turning on and off; etc). Recent developments in wearable technologies such as smart watches also allow collection of accelerometer data. These data would also contain subtle and minority activities, and we speculate that dissimilarity representation based classification of combined static and mobile data will be useful in the future accurate detection of important events in the ageing population. Also to demonstrate the generality of DRHC, we will further evaluate DRHC with other feature extraction techniques [18].
