Abstract
The online smartphone-based human activity recognition (HAR) has a variety of applications such as fitness tracking, healthcare…etc. Currently, the signals generated from smartphone-embedded sensors are used for HAR systems. The smartphone-embedded sensors are utilized in order to provide an unobtrusive platform for HAR. In this paper, we propose a deep convolution neural network (CNN) model that provides an effective and efficient smartphone-based HAR system. For automatic local features extraction from the raw time-series data, we use the CNN while simple time-domain statistical features are used to extract more distinguishable features. Furthermore, we explore the impact of a novel data augmentation on the recognition accuracy of the proposed model. The performance of the proposed method is evaluated using two public data sets (UCI and WISDM) which are collected using smartphones. Experimentally, we show how the proposed model establishes the state-of-the-art performance using these datasets. Finally, to demonstrate the applicability of the proposed model for online smartphone-based HAR, the computational cost of the model is evaluated.
Keywords
Introduction
The Human activity recognition (HAR) has assisted in a diversity of applications such as fitness tracking, healthcare, entertainment...etc. [1]. Traditionally, wearable sensors were commonly used for HAR. They are designed in very small size to be worn on human body. They can acquire human physiological states such as body moving orientation and speed. Well-known examples of these sensors are accelerometer and gyroscope. The essential limitation of wearable sensors is considered as an obtrusive platform because they are uncomfortably worn to part of the human body (waist, wrist, chest, legs and head). In addition, they need to be recharged regularly to provide a long-term solution to HAR.
To overcome the limitations of traditional wearable sensors, researchers recently have employed a variety of embedded-sensors in modern smartphones, such as accelerometers and gyroscopes, for HAR [1]. The advantages of using smartphones as wearable sensors are: they are cost-effective since they are easy to maintain and recharge regularly by the user, as well as they provide an unobtrusive platform for HAR. In addition, with continuous enhancements in computational capabilities of smartphones they provide a more adequate environment for online HAR [1], where the activity recognition task is done in real-time.
Online Smartphone-based HAR is considered as a machine learning task, more specifically a time-series classification problem. This problem can be solved in three main steps: windowing, feature extraction and classification. The main critical step is the feature extraction, since it predominates the overall performance of the system. The main machine learning feature extraction approaches utilized in the literature are shallow learning and deep learning. In shallow learning methods, the domain experts extract handcrafted features explicitly, whereas in deep learning, which is a recently emerged approach, the model implicitly extracts the features from the raw data.
However, achieving high recognition accuracy at low computational cost is required in online smartphone-based HAR. Initially, the shallow learning approaches such as support vector machine (SVM), k-nearest neighbor (KNN) and hidden Markov model (HMM) have been used. They have been shown effective for solving many well-constrained problems, but they are imperfect to model more complicated real-world applications such as computer vision and speech recognition.
Most recent, deep learning methods are gaining an intensive attention because they provide better performance in many fields such as speech recognition, which apply on a time-series data as in our problem and for this reason, they have been recently used for online smartphone-based HAR [12–22].
In this research work, we propose a CNN architecture as an automatic feature extractor and classifier for online smartphone-based HAR. We have chosen the CNN from among other existing deep learning models since it has two main features: local connectivity and parameter sharing. These properties of CNN reduce the number of connections and parameters and therefore, lead to a low computational cost if compared to other deep learning models. In addition, these properties of CNN model enable us to build online (real-time) HAR system since they reduce the computational cost of the model.
The weakness of the deep learning models is getting over-fit when the number of training data is small. To reduce the over-fitting problem of the deep model, data augmentation is considered as the most widely used approach in recent literature [28]. To the best of our knowledge, this work is the first work that utilize data augmentation to improve the performance of online smartphone-based HAR.
The main contribution of this research work are: We propose a deep learning architecture that uses CNN together with time-domain statistical features that efficiently represent the raw time series data of smartphone-based HAR. We propose a new data augmentation approach for smartphone-based HAR that enhances the accuracy of the proposed model by reducing the data overfitting problem. We show how the proposed model outperforms the best current solutions, and we launch the state-of-the-art result on two benchmark datasets (UCI and WISDM). To validate the applicability of the proposed models for online smartphone-based HAR, we evaluate the computational cost of models on modern smartphone.
The reminder of the paper is organized as follows. In Section 2, the related study is stated. A detailed description of the proposed model is explained in Section 3. In Section 4, the experimental results of the proposed method are demonstrated. Conclusion and future works are given in Section 5.
Related works
Initially, many shallow learning approaches have been applied for online smartphones based HAR. Usually, shallow learning methods are accomplished in two fundamental steps: handcrafted features extraction and classification. The handcrafted features are generated from raw data (time-series data in our case) by domain experts. The limitations of handcrafted features are that they are problem-specific and the fact that they lack systematic methods of feature extraction in HAR researches. Always, the feature is extracted either in time-domain or frequency domain. Time domain features usually do not extract discriminative features and consequently, researchers have combined the frequency domain with time domain features. Extracting the frequency domain features uses common signal transformation techniques such as FFT and DFT.
After the handcrafted features are extracted, they feed to the second step, which is the classifier. For online smartphone-based HAR, there are a variety methods which were implemented using different machine learning approaches such as Naive Bayes [2], K-Nearest Neighbor (KNN) [2–4], Decision Tree [5], Support Vector Machine (SVM) [6–8], Neural Networks [9, 10], and Boosting algorithm [11]. They have shown effective for solving many well-constrained problems, but they are imperfect to model more complicated cases such as time-series data (as in our problem), which require better approaches such as deep learning.
Recently, deep learning methods have been gaining an intensive attention because they provide better performance on many forms of complex data, such as image and time-series data. The main advantage of deep learning over shallow learning is that deep learning is able to extract discriminative features automatically corresponding to the task on hand. In addition, deep learning methods take the advantage of the unlabeled data, which is abundant and easy to get, to learn a more abstract features. Moreover, several layers of feature representations can be stacked to create deep networks, which are more capable to model the complex structures of the data.
Just recently, researchers intensively have utilized deep learning methods for online smartphone-based HAR [12–22]. To avoid high cost computation during the test (prediction) phase, researchers do not use very deep architecture (only 2–4 layers). It should be mentioned here that training a deep learning model usually takesa very long time (several hours or a few days) and is computationally expensive. However, this is not an issue in our case as the training phase is done offline on powerful computers.
For online smartphone-based HAR, a variety of deep learning models have been utilized in the literature. For example, in [12–18] a CNN model was used. In [19, 20] the Stacked Auto encoders (SAE) model is used; and in [21] a Deep Belief Networks (DBN) model is utilized. The Recurrent neural network (RNN) is applied in [22]. The majority of the related work use CNN model. CNNs were used mainly for two fundamental reasons. First, the transition invariance property of CNN can extract the features which considers the different forms and styles people perform in the same activity. The second reason is the less computational cost of the CNN model compared to other deep learning models as the result of the weight sharing and local connectivity properties of the CNNs.
The main stream of the related deep learning works [12–16, 22] use deep architectures which are theoretically able to learn more abstract features. However, this often leads to overfitting, since the training data is very few compared to the number of the model parameters. To overcome the data overfitting problem, the study in [17] uses shallow CNN architecture, but was not able to learn more abstract features. Nevertheless, none of the related works use data augmentation methods to improve the performance of online smartphone-based HAR. Based on the above observation, we use a quite deep model to learn more abstract features as well as reducing the data overfitting problem using a novel data augmentation method for smartphone-based HAR.
In order to increase performance of the deep learning CNN model, research in [13] feeds temporal fast Fourier transformations with raw signal into the model. The limitation of this method is the high computational cost. In some other works, simple statistical features are augmented with the CNN auto-represented features [14, 17]. In [14], 102 statistical features are used where in [17], 40 statistical features were used. In this proposal, we utilize a fewer time-domain statistical features that lead to a better performance.
The proposed model
In this section, we first recall the general architecture of CNN model in Section 3.1. Then, the architecture of the proposed model is demonstrated in Section 3.2. In Section 3.3, the proposed data augmentation method is explained.
Convolutional neural network (CNN)
A traditional neural network (NN) receives an input as 1D vector. The received input is transformed through a series of hidden layers ending up in an output layer. Each hidden layer consists of a set of neurons, where each neuron is fully connected to all neurons in the previous layer. The connection of each neuron is independent from all other neurons in the same layer.
On the other hand, CNNs make a clear assumption that the received inputs are 3D vectors representation (width, height and depth). Then a series of convolution and pooling are applied to the input. After that, a regular neural network is stacked on top of this series to learn the classification weights. This kind of architecture makes the forward function more efficient to implement and significantly reduce the number of parameters. The complete description of the CNN structure is explained as follows:
Convolutional layer: In this layer a convolution operation is computed between two vectors
Activation function: This function is applied on each output of the convolution layer to distinguish the non-linear decision boundaries of the model. There are three different activation functions that are usually utilized in this step. The functions are sigmoidal, hyperbolic tangent and ReLU as shown in Equations (1– 3), respectively.
Pooling layer: This layer is used in order to reduce the dimension of feature representation produced by the convolution layer. In this layer, the two common approaches are taking the maximum or average of a pre-defined block size (e.g. 1 × n) of the input data.
Fully connected layer: In this layer, the output of applying a series of convolution and pooling are flattened to 1D vector. The flattened vector is used to learn the classification weights using the labeled data. However, one or more hidden layers can be added on top of the flattened layer to learn more complex representations.
Output layer: This is the last layer of CNN which is located on top of the last fully connected layer. The aim of this layer is to find out the probability distribution over the predicted labels.
In this research work, we propose a CNN architecture for online smartphone-based HAR as shown in Fig. 1. The input of the model is the pre-processed raw data of the inertia sensors (accelerometer and/or gyroscope). We use mean-centering to transform the raw data in a form that is suitable to learn the optimal weights using the proposed model.

The proposed model architecture.
In the first convolution layer, the model learns 100 convolution filters that are used to transform the input data to a more details representation. Then, the second convolution layer learns 200 filters that are used to transform the feature maps of the first layer to give more rich representations. The size of the filters used in both convolution layer are 1 × 15. In addition, the ReLU activation function is applied in both layers. We used the ReLU activation function for two reasons: experimentally it produces a higher accuracy and it is computationally efficient compared to other activation functions. After that, the dimensionality of the second convolution layer is reduced by foure times using the max-pooling layer with size 1 × 4.
The reduced representation of max-pooling layer is then flattened into a 1D vector. The flattened vector is extended by adding time-domain statistical features listed in Table 1. Then, the extended vector is connected to a fully connected layer made up of 512 neurons. The last layer of the architecture is a soft-max (output) layer which is used to calculate the probability distribution over the activitylabels.
Time-domain statistical features
Finally, we feed the network by the training data and optimize the network parameters using a modified version of stochastic gradient descent (Adam) with back propagation algorithm. In the following section, we explain the proposed data augmentation method used here to improve the performance of the proposed model.
In the context of deep learning, it has been proved that the performance of the model enhances when there is a large-scale data [28]. However, since collecting a more labeled data is expensive and time-consuming, data augmentation is justifiably used as an alternative approach [28]. Data augmentation is achieved by creating new labeled data from the original labeled data and use the merged data set to train the model. Creating new data is done by applying a label-preserving transformation on the raw data. However, finding a label-preserving transformation is domain specific. However, as an example from the field of image processing, it has been reported that scaling, cropping and rotating may be considered as label-preserving transformations as they are expected to happen in real world applications.
HAR accelerator generated signals are more challenging than imaging in finding out label-preserving transformations, since it is non-trivial to define the label of the activities by the naked eye. Being quasi-periodic, the accelerometer signal can be manipulated without losing the general features of the signal. In this paper, we propose two label-preserving transformations for online smartphone-based HAR, as detailed subsequently.
Circular shifting: In this transformation, each instance of the training data is shifted by an integer value ranging [8, 24,... WindowLength – 8]. We chose this transformation because after visualizing the data we observe that a part of signal (event) within-window occurs at different locations for different instances of a similar activity. Therefore, the circular-shift transformation reduces the dependency on the event location. Figures 2, 3 show the implication of circular-shift transformation on the walking and sitting activities, respectively.

The effect of circular shifting on walking activity.

The effect of circular shifting on siting activity.
Start-point shifting: This transformation is conducted during the windowing time. For a successive similar activity, we shift the starting point of the first window by n where, n < WindowLength. The first new generated window makes up from the last part of the first original window and the beginning part of the second window, and so on for other new windows. In Fig. 4, we show how the start-point shifting transformation is implemented. It is worth mentioning that this transformation is label-preserving because the new generated instances are made up from other instances that have a similar activity.

Generating new instances from a successive walking upstairs activities.
This section is organized as follows. In Section 4.1, a details description of used dataset is given. Then, a profound analysis of the proposed model performance is provided in Section 4.2. Finally, in Section 4.3, we compare our model with state-of-the-art methods.
Datasets description
We have used a publicity available datasets (UCI [23] and WISDM [24]) to train and evaluate the proposed model. We selected these datasets since they collected from Android smartphones where the proposed method is smartphone-based HAR. Another gain of using these datasets is that they already used in several cutting edge related works [12–22]. The nature of these datasets is described in the following sections.
UCI dataset
This dataset is publicity available in two formats: raw data and handcrafted features. In this work, the raw data format is used. The data set built with a group of 30 volunteers within an age between 19 and 48 years. Each person was wearing a smartphone (Samsung Galaxy S II) on the waist; and performed six different activities (walking, upstairs, downstairs, sitting, standing and lying). At a sampling rate of 50 Hz, the accelerator and gyroscope signals are collected. The generated dataset was divided into two parts: 70% of the volunteers were selected for the training set while the other 30% were used for the test set. Specifically, the training set contains 7406 instances while the test set contains 2993 instances. All instances are manually labeled with one of the six activities. The distribution of the training and test set are listed in Tables 2, 3.
UCI training data distribution
UCI training data distribution
UCI test data distribution
The data set was built with a group of 36 volunteers. Each person was carrying an android smartphone in a front leg pocket; and performed six different activities (walking, jogging, walking upstairs, walking downstairs, sitting and standing). At a sampling rate of 20 Hz, the triaxial linear acceleration signals using the smartphone accelerometer are collected. The generated dataset was not divided into training and test sets. However, in [17] the author splits the dataset in a way that provides the highest test error to see to what extend his proposed work improve the performance. Users ids, labeled 1 to 26, of the volunteers were selected for the training set while the other 10 ids were used for the test set. Specifically, the training set contains 7367 instances while the test set contains 3026 instances. All instances are manually labeled with one of the six activities. The distribution of the training and test set are given in Tables 4, 5.
WISDM training data distribution
WISDM training data distribution
WISDM test data distribution
In this section, we analysis the proposed model from three different standpoints. First, we discuss how the proposed model parameters were selected (see Section 4.2.1). Then, we discuss the performance of the proposed model in term of recognition accuracy in Section 4.2.2. Moreover, the impact of statistical features and data augmentation on the proposed model. Finally, in Section 4.2.3, the computation cost of the proposed model is analyzed.
Parameters selection
The proposed CNN model contains many hyper-parameters that should be chosen carefully, since they control the performance of the model. We use a grid search technique to find out the optimal mix of these parameters. The main parameters that we investigate for the proposed model are: pooling size, dropout rate, conventional filter size, and number of filters (feature maps) in each convolutional layer.
Pooling size: In Fig. 5, the influence of different pooling size on accuracy of the model is shown. The maximum accuracy was found when the max-pooling size is four.

Dependency between max-pooling size and accuracy.
Convolution filter size: The effect of diverse filter size on recognition accuracy of the model is revealed in Fig. 6. The maximum accuracy was found when the convolution filter size is 15. Closer to the maximum accuracy given in the Fig. 6, we have in addition searched for the points closer to the maxima (16 in the Fig. 6) to obtain better results. While we search for some points (after and before) 16, we find that 15 is the optimal filter size.

Dependency between convolution filter size and accuracy.
The number of convolution feature maps: Here, the number of feature maps in each convolutional layer was continuously altered. The influence of this altering on the performance of the model is demonstrated in Fig. 7. As shown in Fig. 7, the maximum accuracy is achieved when the number of feature maps are 100 and 200 for the first and second convolution layers, respectively.

Dependency between the number of convolution feature maps and accuracy.
Dropout: The impact of different dropout rates on accuracy is depicted in Fig. 8. We figured out that accuracy starts decreasing continuously after a dropout rate of 0.1 so we investigate the rate between 0 and 0.1 only.

Dependency between dropout rate and accuracy.
Regarding to the number of convolution layers, the accuracy of the model is decrease significantly when only one convolution layer is used. Whereas, the accuracy decreases slightly when three convolutional layers are used. Therefore, we selected two convolution layers, which consequently, improve the performance of the proposed model.
In this section, we evaluate the performance of the proposed model. In addition, the impact of using the selected time-domain statistical features and the proposed data augmentation method on the recognition accuracy are demonstrated.
For UCI dataset, we use the average accuracy to measure the performance since the test set for this dataset is balanced. In Table 6, the accuracy of the proposed model with different settings is shown. It’s clear that statistical features enhance the accuracy significantly by 2.01% and 2.34% based on the original and augmented data, respectively. Regarding the impact of data augmentation, it improves the accuracy by 0.47% and 0.8% without statistical features and with statistical features, respectively. Overall, the statistical features and data augmentation method increase the accuracy of the proposed CNN model from 95.72% to 98.53%. To our knowledge, this accuracy is the maximum accomplished for HAR using UCI data set.
The implicationof various techniques on the proposed model using UCI dataset
The implicationof various techniques on the proposed model using UCI dataset
In addition, we perform 3-fold user based cross validation to prove the generality of the achieved result on UCI dataset. To do that, we select 9 users for testing and other 21 users for training. This procedure was performed three times using three disjoint test sets as listed in Table 7. The corresponding accuracy for each experiment is shown also.
3-fold user based cross validation result using UCI dataset
Based on the above table, the generality of proposed model is satisfied since the average accuracy is approximately similar to each single experiment accuracy.
For WISDM dataset, we use the F1-score to measure the performance since the test set of this dataset is unbalanced. In Table 8, the F1-score of the proposed model with different settings are shown. From Table 8, it’s clear that statistical features do not improve the performance of the proposed model significantly as in UCI dataset. The reason behind that is the existence of noise which reduces the quality of statistical features. To show the noise in WISDM dataset comparing to UCI, in Fig. 9 we depict the walking downstairs activity for these datasets. Regarding to the impact of data augmentation method in WISDM, it improves the performance of the propose model by 0.85% and 0.66% without statistical features and with statistical features, respectively. Overall, the statistical features and data augmentation method improved the performance of the proposed CNN model from 93.85% to 95.13%. To our knowledge, this accuracy is the maximum attained for HAR using WISDM data set.
The implication of various techniques on the proposed model using WISDM dataset

Walking downstairs (a) in UCI. (b) in WISDM.
To validate the proposed data augmentation method, we write down the performance of the model for each activity using the original and augmented data as shown in Tables 9, 10 for UCI and WISDM datasets, respectively. Referring to Tables 9, 10, with the augmented data, we observe that the performance stays at least similar in comparison with the original data. We can conclude that the proposed data augmentation method is valid and suitable for smartphone-based HAR.
The impact of the proposed data augmentation method for each activity using UCI dataset
The impact of the proposed data augmentation method for each activity using WISDM datasetdown sampling
From Tables 9, 10, it is obvious that the proposed model significantly enhances the accuracy of the static activities (siting and standing). The majority of the related works have failed to achieve a high performance for the static activities since these activities approximately have a similar form. Based on the above discussion, the robustness of our model is fulfilled.
In conclusion, the performance of the proposed model using UCI dataset is better than WISDM for two reasons. First, in UCI dataset, the activity prediction based on two sensors (accelerometer and gyroscope), while only the accelerometer is used in WISDM. The second reason is the existence of a significant noise in WISDM dataset since the accelerometer sensor is located in a front leg pocket of the person during the data collection.
In this section, we confirm the applicability of the proposed model for online smartphone-based HAR. In online smartphone-based HAR, the response time should be very small (i.e. in real time). According to [17], it is enoughto predict 1-5 activities per second. Therefore, we measure how many activities can be predicted per second using the proposed model.
The proposed model was implemented using a Tensorflow python library, which is available for smartphones. Therefore, we deploy the proposed model on Nexus 5X Android smartphone in order to make the validation procedure more realistic.
To measure how many activities can be predicted per second, we use the proposed CNN model and light CNN, a variation of the proposed model that add a max-pooling layer between the convolutional layers. As shown in Table 8, we can predict six activities per second, which is appropriate for online smartphone-based HAR. For some applications, it is recommended to use light CNN if it is desirable to predict more activity rates. It is worth mentioning that the light CNN model can predict 22 activities/s with negligible drop in accuracy (less than 0.2) as shown in Table 11.
Computational cost of the proposed model for UCI dataset
Computational cost of the proposed model for UCI dataset
In this section, we compare the performance of the proposed method with the related state-of-the-art methods. In Section 4.3.1, we explain the comparison setting in order to make the comparison more precise.
The numerical result of the comparison is given in Section 4.3.2. We compare the proposed model with variety of machine learning approaches.
Comparison setting
We compare the proposed model against the works that created a user-independent model. User-independence means dividing the dataset into two disjoint sets (A and B) , where A contains the activities of some users for training and B contains the other users activities for testing.
For the UCI dataset, the dataset, originator already splits the data into training and test sets. The training set makes up from twenty-one users where the testing set makes up from the other nine users. There are many works that utilized the UCI dataset to evaluate their proposed models. The majority of the proposed models use the average accuracy to evaluate the performance since we have a balance test set (see Table 3).
For the WISDM dataset, there are only two works [10, 17] that developed user-independent models. In [10] a leave-one-out validation technique is used to evaluate the model. Since it is computationally expensive to evaluate the CNN model using leave-one-out approach, we use the split that proposed in [17]. In [17], the authors extract 10 out of 36 users, whom provide the highest test error in order to see to what extend the proposed work improves the recognition accurecy. F1-score is used the evaluate the performance since the test of this dataset is unbalanced (see Table 5).
Result
In this section, we evaluate the performance of the proposed model compared to the related works that use UCI and WISDM datasets. To make the evaluation more comprehensive, we compared our method with different approaches that utilize either shallow or deep learning. From Table 12, it is clear that the proposed model outperforms the best state-of-the-art work in [17] by 1.81% using WISDM dataset. In addition, the proposed method outperforms the best reported research works [17] by 0.90% using UCI dataset as shown in Table 13.
F1-score of the proposed smartphone-based HAR approaches using WISDM dataset
F1-score of the proposed smartphone-based HAR approaches using WISDM dataset
Recognition accuracy of the proposed smartphone-based HAR approaches using UCI dataset
To compare the performance of the proposed model in more details, we compare it with the work in [17] that provides the highest reported recognition accuracy. Figures 10, 11 show the F1-score in each activity of the proposed model compared to [17] for UCI and WISDM datasets, respectively. In UCI dataset, it is clear that the proposed model outperforms the work in [17] in all activities. In WISDM dataset, we outperform significantly the work in [17] in three activities (upstairs, downstairs and sitting) by 9% on average, whereas in [17] the accuracy of the other three activities is slightly better than our model by 1% on average.

F1-score per activity for UCI dataset.

F1-score per activity for WISDM dataset.
In this paper, a deep convolution neural network (CNN) model that provides a robust smartphone-based HAR is proposed. For automatic features extraction from the raw time-series data, we use the CNN model, while a simple statistical features are used to extract a more discriminative features. Furthermore, we proposed a novel data augmentation method and investigated the impact of this method on the recognition accuracy. The performance of the proposed model is evaluated using two publically available data sets (UCI and WISDM) which are collected using smartphones. The evaluation results demonstrated that the proposed model launches the stat-of-the-art result using these datasets. In addition,the computational cost of the model is evaluated in order to demonstrate the applicability of the proposed model for online smartphone-based HAR.
In the future works, we will explore a new label-preserving transformation thatcould increase the robustness of smartphone-based HAR. In addition, we will improve the performance of the proposed model by adding more educated classifier on top of the CNN model instead of simple soft-maxclassifier.
Footnotes
Acknowledgments
This work was supported by the Research Center of College of Computer and Information Sciences, King Saud University. The authors are grateful for this support.
