Abstract
With the accessibility of reasonably valued sensors and sensor systems, sensor-based Human Activity Recognition (HAR) has attracted much consideration nowadays. The use of smart mobile phones for HAR has been a continuous zone of research in which the improvement of fast and efficient machine learning approaches is essential. In the current years, wireless sensor networks had been positioned in the real world to collect measures of information. However, the major task is to extract high-level knowledge from such raw data. In the utilizations of sensor systems, outlier detection has paid more concentration in recent years. Outlier detection is used to expel noisy data, to discover faulty nodes and also to distinguish interesting events. Conventional outlier detection methods are not directly applicable to sensor networks because of the dynamic way of sensor information and confines of the wireless sensor networks. In this paper, a hybrid outlier detection and removal method is proposed to detect abnormal human activities based on the mobile sensor data. Exploratory investigation is done on datasets gathered in various conditions. The outcomes demonstrate that the proposed method in combination with standard classifiers performs superior to other anomaly detection methods as far as different quality measurements.
Introduction
Outlier detection also called anomaly detection in Wireless Sensor Networks (WSN) is defined as the method of distinguishing the data instances (outliers) that deviate from the rest of the data pattern based on a specific measure [1, 5, 9, 15, 30, 35]. Wireless sensor networks (WSN) comprises of multiple, small, low cost sensor nodes, integrated with sensing, computational power and short range wireless communication capabilities. The performance of these sensing devices tends to diminish as their battery power is depleted. Since these wireless sensor networks may include a huge number of sensors, the possibility of error is more than other traditional networks. These issues make wireless sensor networks more liable to anomalies. Outliers in sensor data have to be detected and removed as it strongly influences the data analysis [2, 4, 9, 15, 41]. The principle assignment of this proposed outlier detection method is to recognize and evacuate outliers in the sensor data with high classification results. Outlier detection controls the standard of measured information, enhances the robustness of the data analysis under the presence of noise and defective sensors so that the communication overhead of erroneous data is diminished. Anomaly detection also provides a proficient approach to search for data that do not take after the normal pattern of sensor data in the network.
Human Activity Recognition (HAR) utilizing smart mobile sensor has empowered many context aware applications in different fields, for example, healthcare, security surveillance. Initially, few wearable sensors were utilized for human activity recognition [7]. In recent days the smart mobile phones are equipped with sensors, like accelerometers, gyroscope, magnetometer, proximity and light sensors. Many researchers started using mobile phones for activity recognition in recent years. The human activity recognition faces a number of unique challenges [8, 10, 11]. The HAR problem is computationally challenging because, unlike normal activities, the outlier data are extremely scarce. Therefore, it is a challenging task to design an outlier detection model that can minimize both the false positive rate and the false negative rate. Another challenge is to design model with high classification accuracy. Due to specific prerequisites, dynamic nature and resource constraints of WSN, the existing outlier detection methods cannot apply directly [10, 11]. The main objective is to propose a Hybrid Outlier Detection (HOD) method for human activity recognition in WSNs. In order to analyze the effect of proposed outlier detection method, activity classification is done by utilizing standard machine learning classifiers. The experimental analysis for outlier removal is carried on two different activity recognition datasets collected from various environments. The experimental results demonstrate that the classification model with HOD as a pre-processing stage has preferred classification results over different techniques.
This paper is structured as follows. Section 2 reveals the related works done. In Section 3, we exhibit the problem design. Section 4 depicts the proposed approach and methodology used. Section 5 gives a detailed description of the data sources used for this analysis. Section 6 explains the experimental evaluation on datasets. Section 7 discusses the simulation results obtained. Finally, Section 8 ends up the paper with the conclusion.
Related work
This section focuses on various outlier detection techniques based on the methodology for WSNs. In statistical based outlier detection methods, based on a distance measure a data point is declared as an outlier if the probability of the data instance to be generated by the outlier model is very low [30, 31, 33, 41]. These techniques usually pose low communication and less computational complexity as they declare the most distant points as outliers based on data distribution. In nearest neighbor based outlier detection techniques, a data point is distinguised as outlier when it is located far from its neighbors. These methods require the Euclidean distance estimation between each pair of data which prompts to most prominent computational complexity. These techniques require the Euclidean distance measurement between each pair of data which leads to greatest computational complexity [41, 43]. In clustering based outlier detection algorithms, data points are grouped to form clusters and the distance measure is utilized to identify outliers [29, 32]. In classification based outlier detection, support vector machine (SVM) and bayesian network classification model are used [3, 6, 8, 28]. The computational complexity of cluster based and classification based techniques strategies is more noteworthy than statistical based techniques. Table 1 summarizes few recent studies based on the literature review done on human activity recognition using machine learning approaches.
Survey of literature done
Survey of literature done
Human activity recognition is an emerging and challenging area of research. Many machine learning based studies exists on activity recognition that use accelerometers [25, 26, 27, 28, 29, 30, 31]. Much of the studies in activity recognition focused on major activities, such as sleeping, sitting, and standing, walking, running, and jumping [9, 15, 35]. In some works, jumping [18], running [36], standing [34] and sleeping [22] were excluded from the experiments in the research. In human activity recognition literature, studies were done both using single sensor and combination of multiple sensors. Gao et al. showed that multiple sensors are not helpful for significantly improving recognition for typical physical activities [13]. Few works proposed a physical activity monitoring system based on multiple sensors. These HAR systems are able to perceive activities like standing, sitting and lying body poses as well as periods of walking with the objective of monitoring elderly people in their day by day lives showed that use of multiple sensors are not helpful for significantly improving recognition for typical physical activities. Most existing HAR systems make focus on human activities detection without considering data outliers. Though many machine learning based outlier detection techniques were proposed for various applications in WSNs, studies on outlier detection for human activity recognition in WSNs is rare [12, 14, 19, 20, 21, 39]. Based on the literature reviewed, the focus of this work is on outlier detection in wireless sensor networks for human activity recognition. An outlier detection method is proposed and evaluated on two datasets collected on various environments. The effect of outlier detection is empirically evaluated using standard classification methods.
The increased accessibility of sensors in WSN has raised interest in the development of human activity recognition techniques for a number of real-world applications such as home monitoring, smart hospitals, physical and sport activities and so forth. Particularly, the prevalence of mobile devices, such as the smart phone with sensors, can offer advanced capabilities to recognize human activity in wireless sensor networks. However, the raw data collected from WSNs are quite often unreliable and inaccurate due to the limited capability of sensor nodes in terms of energy, memory, computational power, bandwidth, dynamic nature of network, and harshness of the deployment environment. These inaccuracies are generically referred to as outliers. A solution to ensure the quality of sensor data is detection of outliers. Outlier detection is particularly an important task in various applications of wireless sensor networks quality [19, 21, 26]. In the most recent years a various outlier detection methods have been proposed in WSN. Applying traditional approaches to deal outlier detection in WSN experience ill effects of a high false positive rate, especially, when the collected sensor information are biased toward normal data while the anomalous events are rare. Outliers detection techniques designed to be implemented on WSNs nodes should have a high detection rate and a low false alarm rate. It is important for decision makers to be able to detect them in order to take appropriate actions.
The Human Activity Recognition (HAR) is one of the application areas where outliers are present [14, 20, 37, 42]. This motivates us to investigate on outlier detection in WSNs for human activity recognition. The main focus of the present work evaluates various individual outlier detection methods for online and offline human activity recognition data over WSNs. Based on the literature work done, most popular SVM and ANN methods are used for classification in this study [14, 37]. Based on the evaluation results obtained from individual outlier detection methods employed, a hybrid outlier detection method is proposed for online and offline human activity recognition data over WSNs. The significant contributions of this proposed work is as follows:
We propose a hybrid approach for outlier detection in WSN for human activity recognition. We perform activity classification after outlier elimination in order to analyze the effect of outliers. We conduct an intensive experimental evaluation for our proposed algorithm on two different data sets collected in the different environment. We perform a series of empirical studies using support vector machine (SVM) and radial basis function neural networks (RBFN) classifiers by comparing our proposed hybrid outlier detection approach to other popular outlier detection methods applied to human activity recognition.
In this section, the proposed approach is introduced in detail. Many recent researches were developed on machine learning based activity recognition that are interested only in activity and did not care about outliers [14, 16, 19, 21, 42]. In this paper, a hybrid outlier detection approach is proposed to detect outliers. The methodology consists of the following steps (Fig. 1):
Problem design.
Data collection: Two different datasets are used for analysis. These datasets differ in their properties such as number of activities, instances, features and environment used for data collection. Feature extraction: Classification algorithms cannot be directly applied to raw time-series accelerometer data. The collected data are subjected to suitable feature extraction process. The transformed dataset represented using features extracted are used for further processing. Outlier Detection: The density based, distance based and the proposed outlier detection methods are applied to both datasets. The identified outliers are removed. Classification: The human activity classification is employed on the outlier removed datasets using standard supervised machine learning classification methods. The classifiers used the transformed features to classify the activity labels. Evaluation: The various evaluation metrics are computed to show the effectiveness of the outlier removal in wireless sensor data for activity recognition.
Outliers are often more interesting than the normal ones in human activity recognition, since they contain useful information underlying the abnormal behavior. This section describes the various outlier detection and classification methods used in this research.
Outlier detection methods
This section describes the various methods used for outlier detection.
Distance based outlier detection
Detecting outliers by their distance to neighboring examples is a popular approach to deal with finding abnormal cases in a dataset. Nearest neighbor distance-based outlier detection (KNN-DB) is a popular method of distinguishing anomalies by examining the distance to an illustration’s nearest neighbor. In this approach, the nearby neighborhood of points is characterized by the K closest neighbors. If the neighboring points are relatively near, then the data point is considered as inliers or normal data point. If the neighboring points are far away, then the data point is considered as an outlier or anomaly. This approach can be applied to any feature space for a Euclidean distance measure defined [40].
Density based outlier detection
Local Outlier Factor (LOF) is a popular density based outlier detection method. The local density of an object is compared to the local densities of its nearest neighbors. The data points that have a significantly lower density than their nearest neighbors are identified and declared as outliers. The k-nearest neighbors are calculated using Euclidean distance. The local reachability density (LRD) for all data points ‘p’ is calculated, based on the set of ‘k’ neighbors
The LRD is inversely proportional to the average reachability distance of the nearest neighbors. The reachability distance reach_dist
If the density of all neighbors of a data instance ‘p’ is higher than the density of the data instance ‘p’ itself, then a LOF will be assigned to the instance indicating a possible outlier. On the other hand, if the densities of the neighbors are approximately as high as in the instance itself, the resulting ratio will be close to one [17].
Different detection methods are based on different assumptions to model outliers. The assumption made for LOF outlier detection is that the density around a normal data object is similar to the density around its neighbours and the density around an outlier is considerably different to the density around its neighbours. The assumption made in KNN-DB method is that the normal data objects have a dense neighbourhood and outliers are far apart from their neighbours. Thus different models will produce different results. Thus, the hybrid method proposed overcomes the drawback of individual outlier detection methods employed. The proposed hybrid outlier detection method is composed of two phases. In the first phase, the individual outlier detection methods employed are constructed. The second phase is the combination strategy which is used to combine the outlier scores. The outlier scores are combined with average scoring. In the proposed hybrid approach, the dataset (D) consists of ‘m’ activity samples which represent the activities performed where ‘
Pseudo code of proposed Hybrid Outlier Detection (HOD) method.
This section describes the various classification methods employed after outlier detection for human activity recognition. Support vector machine (SVM) and Radial basis function neural networks (RBFN) are the methods employed. SVM is a supervised machine learning classification technique which uses a kernel function to map an input feature space into a new space where the classes are linearly separable. The kernel type chosen is a polynomial kernel with default values for kernel parameters like cache size and an exponent. In this work, multi class SVM is used for classification [23]. The radial basis function neural network (RBFN) is a feed forward network with its architecture consists of three layers: an input layer, a hidden radial basis layer and an output linear layer [38].
Data preparation
The datasets for outlier detection are collected and preprocessed for classification. The steps involved in data preparation are described in this section.
Activity datasets
To improve and establish a comparative baseline for outlier algorithms, it is decided to use two datasets. The first dataset is the standard Human Activity Recognition (HAR) Dataset available offline. This dataset is publicly available in UCI machine learning repository. The dataset is a collection of accelerometer readings from 4 sensors worn by each of 4 subjects. It comprises of 6 proposed activities such as standing, sitting, laying, walking, walking upstairs and walking downstairs. The HAR dataset contains 5,418 examples and 6 attributes. The 6 attributes include user id, activity, timestamp, x-acceleration, y-acceleration, and z-acceleration. The second dataset is constructed in an online environment. Tests are performed with five volunteer subjects whose average age is 27. Each subject performed the same predefined activity pattern “walking, sitting, and standing”. Each experiment lasts 4 minutes where each activity is performed for 60 seconds. The same test scenario is repeated 9 times with different system parameters based on window size and sampling rate. All tests are performed on Android mobile phone models Samsung SII Galaxy. The sensor data collected includes acceleration readings and position values. The acceleration reading includes x-axis, y-axis and z-axis acceleration. The position reading contains latitude, longitude and speed. For this analysis, we consider the acceleration readings only. Now the second dataset contains user ID and x-acceleration-acceleration, z-acceleration and activity label.
Feature extraction
Standard classification algorithms cannot be directly applied to raw time series accelerometer data. Instead, the raw time series data are transformed into examples. The data is divided into 10-second segments and features were generated based on the 200 readings contained within each 10-second segment. Next informative features were generated based on the 200 raw accelerometer readings, where each reading contained an x, y, and z value corresponding to the three axes/dimensions. A total of forty six summary features were generated, although these are all variants of just six basic features. A similar transformation is carried out for the second dataset used. The features obtained are described in Table 2.
Feature set of activity datasetss
Feature set of activity datasetss
The offline dataset used for this analysis is represented as dataset-1 (DS1) and the second dataset collected online, in a noisy environment is represented as dataset-2 (DS2). The two datasets are obtained under two different environmental conditions. This is because that, the dataset-1 (DS1) is a standard benchmark collected in a well-controlled experimental environment for the purpose of activity classification. But the second dataset (DS2) is constructed in a noisy real time environment for the purpose of outlier detection. Table 3 gives the description of the datasets obtained for outlier detection.
Description of datasets
Description of datasets
The main objective of this experimental analysis is to compare the performance of the proposed hybrid outlier detection method (HOD) for human activity recognition. Three experimental analyses were conducted with two different datasets. The first analysis aimed at identifying the outliers using two diverse outlier detection methods. The outlier detection methods are distance based KNN-DB outlier detection and density based LOF outlier detection. KNN-DB is a distance based outlier detection method based on the distance of a point from its
Results and discussion
Evaluation of learning algorithms concentrates comparison of algorithms and the applicability of algorithms on a specific domain. Most of the evaluation metrics are based on the ‘confusion matrix’ for a classification task. To validate the performance of classification methods with outlier detection, various evaluation metrics are used. The metrics used are misclassification rate, precision (or) correctness, recall (or) detection rate and Region of Convergence (ROC). The prediction systems are developed using each of the classification methods discussed in Section 4.2 for the datasets (DS1 and DS2). The results are shown in Tables 4–9 as confusion matrix. The confusion matrix contains the number of actual and predicted results for each activity performed. For each 10-fold cross validation, the dataset was first partitioned into ten equal sized sets and each set was in turn used as the test set while the classifier trains on the other nine sets. The obtained results are compared to the actual activities and the various performance parameters are computed.
Confusion matrix for dataset-1 (DS1): With outliers
Confusion matrix for dataset-1 (DS1): With outliers
(a). Outlier identification in the dataset-1 (DS1) using Distance based method; (b).Outlier identification in the dataset-1 (DS1) using Density based method; (c). Outlier identification in the dataset-1 (DS1) using proposed HOD; (d). Outlier identification in the dataset-2 (DS2) using proposed HOD.
Misclassification rate is defined as the ratio of number of wrongly classified activities to the total number of activities classified by the prediction system. Tables 4–7 summarize the misclassification results of dataset-1 (DS1). The overall misclassification is given at the bottom of the matrix. For the dataset DS1, the classification results obtained for SVM and RBFN without outlier detection is shown in Table 4 for dataset-1 (DS1). In Table 4, the classification results of SVM methods. The misclassification rate gets reduced considerably reduced for distance based and density based outlier detection methods in Tables 5 and 6 for both classification methods used. This is because of the outlier methods employed to remove the noisy instances. Among the individual outlier detection methods employed, density based outlier detection method has minimum misclassification rate. The overall misclassification rate is very minimal for SVM and RBFN using HOD method. The misclassification gets reduced to 8.2% for SVM and 17.5% for RBFN, which is lesser compared to employing individual outlier detection methods and RBFN with outliers show that the misclassification rate is comparatively larger for DS1 than employing outlier detection. The classification results obtained for SVM and RBFN with and without outlier detection are shown in Tables 8 and 9 for dataset-2 (DS2). The classification results of SVM and RBFN with outliers (Table 8) show that the misclassification rate is comparatively larger for DS2. The misclassification rate gets reduced considerably for classifiers with individual outlier detection methods (in Tables 8 and 9). This is because of the outlier methods employed to remove the noisy instances. Among the individual outlier detection methods employed, density based outlier detection method has minimum misclassification rate. The overall misclassification rate is very minimal for a HOD method for SVM and RBFN. The misclassification gets reduced to 6.3% for SVM and 7.2% for RBFN, which is lesser compared to employing individual outlier detection methods.
Confusion matrix for dataset-1 (DS1) for distance based outlier detection
Confusion matrix for dataset-1 (DS1) for distance based outlier detection
Confusion matrix for dataset-1 (DS1) for density based outlier detection
Confusion matrix for dataset-1 (DS1) for proposed hybrid outlier detection
Confusion matrix for dataset-2 (DS2)
Confusion matrix for dataset-2 (DS2)
Correctness is defined as the ratio of the number of activities correctly classified to the total number of activities classified. From the results in Table 10, it is found that the SVM and RBFN methods without outlier removal lead to low correctness for DS1 and DS2. This implies that a large number of outliers have a negative impact on the correctness of the classifiers. Among the individual outlier detection methods, the classifiers prove highest correctness for density based outlier detection method for both classifiers irrespective of the datasets used. The correctness value is much higher for classifiers combined with hybrid outlier detection method compared to other individual outlier detection methods used for DS1 and DS2. For DS1, the precision of SVM is 89.4% and RBFN is 79.6% with HOD. For DS2, the precision of SVM is 93.4% and RBFN is 92.8% with HOD. This proves that the classifier with hybrid outlier detection method have a strong relationship to activity classification than the classifiers with individual outlier detection methods. In general, hybrid outlier detection based classifiers predicts the activities very accurately with high correctness.
Recall (or) detection rate
Recall (or) Detection rate (or) Completeness of the classification models is shown in Table 11. From the results in Table 10, it is found that the SVM and RBFN methods without outlier removal lead to low detection rate for DS1 and DS2. This implies that a large number of outliers have a negative impact on the detection rate of the classifiers. Among the individual outlier detection methods, the classifiers show highest detection rate for density based outlier detection method for both classifiers irrespective of the datasets used. Table 11 shows that the classification method with hybrid outlier detection methods predicts the maximum activities present compared to other methods used for all datasets among three outlier detection models, HOD of SVM and RBFN predicts the maximum activities with high completeness of 87.9 % and 75.3 % for DS1. For DS2 the high detection rate is noted for hybrid outlier detection method as 91.7% for SVM and 88.8% for RBFN
Visual evaluation
Receiver operating characteristics (ROC) curves are used for comparing the performance of the classifiers visually. ROC curve is plotted using false alarm rate in X-axis and detection rate in Y-axis. The misclassification rate, precision and recall values in Tables 4–11 show that SVM performs better than RBFN for all outlier detection methods and datasets. In addition to various quality metrics evaluated so far, the performance of SVM classifier is also evaluated visually using ROC curves. Analyzing the ROC curves of datasets DS1 and DS2 from Figs 4 and 5, it can be observed that the hybrid outlier detection method combined with SVM classifier outperforms single distance based and density based approaches.
Precision of datasets
Precision of datasets
Recall of datasets
ROC curves of SVM with outlier detection methods for DS1.
ROC curves of SVM with outlier detection methods for DS2.
Furthermore, the hybrid outlier method has very similar ROC curve dominating the individual detection methods for both datasets. Also, it can be observed that the distance based approach applied with SVM classifier has the much worse ROC curve than SVM with density based approach. SVM classifier applied without outlier detection has the much worse ROC curve than SVM with various outlier detection methods. This is due to that density computation in LOF approach are significantly influenced by noisy and irrelevant features, and thus the LOF performance also degrades. However, when the proposed methods for combining outlier detection algorithms are applied to the dataset-2, it can be observed that they were able to alleviate the effect of noisy features more effectively.
From the results it is found that among the three outlier detection methods, the hybrid outlier detection method based prediction models perform well in all aspects. HOD serves an excellent role to make the outlier detection process more accurate because of its compound combination of two diverse detection methods. Among the classifiers used, SVM has better performance in terms of misclassification rate, correctness and detection rate than RBFN. This may be due to the fact that the generalization performance of neural classifiers considers the structure size, and the selection of an appropriate structure relies on cross validation. The performance of SVMs depends only on the selection of kernel function type and parameters, but this dependence is less effective. It is also noted that RBFN does not perform well when the size of dataset increases, and it is suitable for small datasets. Among the datasets used, a drastic increase in the detection rate and precision is observed for DS2 before and after outlier detection. The possible reason for this might be due to the environment used for data collection and the quality of sensing device used. Also, the outliers may occur in an environment with clutter and variable lighting. For DS1, the activity recognition is conducted in a well-controlled environment. Once the proposal has been analyzed, a comparison with other proposals for activity monitoring is made. The comparison is based on the average accuracy. The accuracy value is calculated from the misclassification rate computed in the Tables 4–9. Based on literature conducted (Table 1), an analysis of the latest work in the activity recognition field was developed. From all related works, studies shown in the Table 1 were chosen to carry out the analysis. All these studies are recent and present a large number of citations. Table 1 shows the results in terms of accuracy. Our proposed approach achieves an accuracy of 91.1% for offline dataset (DS1) and 93.7% for online dataset (DS2). This shows that the number of activities recognized by employing our proposed outlier detection method is higher than those in the other literature in terms of accuracy for both datasets. The following are some possible threats to the validity of the proposed method. This work was carried out in a class imbalance condition. A suitable sampling method may be adopted to balance the distribution before employing the outlier detection. For online data collection, we should take into account the various hardware limitations. Though a reasonable number of activities are recorded, the performance of the proposed outlier detection methods must be analyzed by increasing the number of activities further. The performance of the proposed method must be analyzed for application other than monitoring daily human activities.
A hybrid approach for combining outlier detection algorithms was presented. Experiments on several datasets indicate that proposed hybrid outlier detection method can result in much better detection performance than the single outlier detection algorithms. The proposed combination method successfully utilizes benefits from combining multiple outputs and diversifying individual predictions. Datasets used in our experiments contained different percentage of outliers, different sizes and different number of features, thus providing a diverse test bed and showing wide capabilities of the proposed framework. The general nature of the proposed framework allows that the combining schemes can be applied to number of combinations of outlier detection algorithms. Although performed experiments have provided evidence that the proposed method can be very successful in the outlier detection task, future work is needed to fully characterize them especially in very large and high dimensional databases, where new algorithms for combining outputs from multiple outlier detection algorithms are worth considering
