Abstract
Human activity recognition (HAR) is a crucial area of research in human-computer interaction. Despite previous efforts in this field, there is still a need for more accurate and robust methods that can handle time-series data from different sensors. In this study, we propose a novel method that generates an image using wavelet transform to extract time-frequency features of the recorded signal. Our method employs convolutional neural networks (CNNs) for feature extraction and activity recognition, and a new loss function that produces denser representations for samples, improving the model’s generalization on unseen samples. To evaluate the effectiveness of our proposed method, we conducted experiments on multiple publicly available data sets. Our results demonstrate that our method outperforms previous methods in terms of activity classification accuracy. Specifically, our method achieves higher accuracy rates and demonstrates improved robustness in real-world settings. Overall, our proposed method addresses the research gap of accurate and robust activity recognition from time-series data recorded from different sensors. Our findings have the potential to improve the accuracy and robustness of human activity recognition systems in real-world applications.
Introduction
Human activity recognition is a significant domain in ubiquitous computing and human-computer interaction. Researchers in this field have employed various machine learning methods to classify human activities, such as walking, running, and cooking [2,24]. Recognizing daily activities can lead to better habits. The methods proposed thus far can be categorized based on the type of sensors utilized, the field of application, and the employed algorithms. With the advancement of technology, especially in the manufacturing of smartphones and wearable sensors, data required for activity recognition can now be obtained from these sensors located on different parts of the body. Wireless technologies can also be utilized to transmit signals.
Due to the significant advancements in sensor manufacturing technology in recent years, various sensors have been proposed and utilized for activity recognition. Sensors can generally be categorized into three types.
Wearable sensors are the most commonly used sensors for human activity recognition due to their ease of use. These sensors are worn by the user on different parts of the body, such as the wrist, ankle, or neck. Accelerometers, gyroscopes, and magnetometers are often employed in these sensors. For instance, the user’s movement status can be detected based on changes in speed and angular velocity.
Object sensors are designed to detect the movement of objects. For instance, the activity of drinking water can be detected by placing an accelerometer sensor on the glass. Radio frequency identifier tags (RFIDs) are used as object sensors in smart homes and medical applications [17].
Ambient sensors can record the interaction between the user and the environment, and control changes in the environment. This category encompasses a variety of sensors such as radars, light, sound, temperature, humidity, and pressure sensors.
Wearable and mobile sensors are readily available and user-friendly, making it easier to work with their data than with images and videos. While video-based activity recognition requires extracting features such as the histogram of oriented gradients, using smartphones or wearable sensors involves employing less computationally complex statistical and frequency features. The use of surveillance cameras to detect human activity is riddled with limitations, such as privacy concerns and sensitivity to ambient lighting, which have spurred further research on wearable sensors [21].
Smartphones capture valuable data for activity recognition via various sensors, including accelerometers, gyroscopes, magnetometers, heart rate sensors, and body temperature sensors. These sensors can detect the user’s motor activities, as well as whether the user is in a bright or dark environment using proximity and light sensors [12]. Additionally, other sensors such as thermometers, barometers, humidity meters, and pedometers have been applied in some cases to assist the elderly [9].
Human activity recognition (HAR) involves recognizing activities performed by a person using sensors such as accelerometers, gyroscopes, and magnetometers. HAR data is characterized by being multivariate time series data, where each time series represents the sensor data from a specific sensor location [3]. The data is time-ordered and contains local and global patterns. Local patterns refer to significant changes in the data, while global patterns refer to overall trends. HAR data is challenging because it is noisy and contains a large amount of variability due to individual differences in the way people perform activities [22]. Additionally, the data is often unbalanced, meaning that some activities may be underrepresented in the dataset, which can lead to biased models [3]. These challenges make it difficult to develop accurate and robust HAR models.
Human activity recognition involves a series of preprocessing steps, including segmentation, feature extraction, and classification model training. These steps are necessary to process the data collected from sensors. Preprocessing involves the removal of potential noise and the use of filters, such as high-pass, low-pass, Laplacian, and Gaussian filters, to prepare the raw data for training. To segment the signals, techniques such as sliding windows, and event- or energy-based methods can be used. Each technique aims to produce samples for model training. In the sliding window method, smaller time windows are considered, resulting in more samples that are optimal for model training. Furthermore, due to the periodic nature of human activities, considering a time window provides sufficient information for activity recognition. Proper feature extraction from segmented signals is crucial for accurate classification. Different methods, including time-domain and frequency-domain features, have been proposed for feature extraction.
After collecting and preprocessing the data, the next step in human activity recognition is feature extraction. Two main approaches used in this stage are manual feature extraction and deep neural networks for feature extraction. Given the signal nature of the recorded data, various methods of time and frequency analysis have been employed for feature extraction [13,16]. The general process of detecting human activity is illustrated in Fig. 1.
Manual feature extraction methods rely on the expertise of the analyst and the features extracted for one problem may not be applicable to other problems [21]. To address these limitations, significant research efforts have focused on automating feature extraction through deep learning techniques [1,18,20].
In this study, we propose a novel method for HAR that combines wavelet transform and CNN-based feature extraction. The proposed method improves upon previous works in several ways. Firstly, we use wavelet transform to generate a time-frequency image of the raw sensor data, which captures the local and global features of the data and reduces the effects of noise. Secondly, we use a CNN-based feature extraction method to extract features automatically from the time-frequency image, which improves the accuracy of the model. Thirdly, we develop a deep network cost function that encourages the model to learn denser feature representations and enhances the model’s generalizability to new data. The contribution of this study is summarized as follows:
Proposing a model for human activity recognition using wearable sensors
Offering a method to generate an image of human activity using the wavelet transform
Combining time-frequency analysis methods and deep learning methods
Developing a deep network cost function to learn denser representations and enhance the model’s generalizability for new pieces of data
The structure of the paper is as follows. Section 2 provides an introduction to the fundamental concepts and presents a literature review. Manual feature extraction methods and deep learning approaches are discussed in detail. In Section 3, we propose our method for human activity recognition using wearable sensors. In Section 4, we present the results of our proposed method and compare them with other existing methods. Finally, Section 5 concludes the paper and provides recommendations for future research.

An illustration of sensor-based activity recognition using conventional pattern recognition approaches [26].
In this section, we first introduce the manual methods of feature extraction in the time and frequency domain and, then, introduce the neural network structures used in human activity recognition.
Manual feature extraction in time and frequency domains
Various statistical features in the time domain are commonly used for human activity recognition, such as mean, median, variance, and other order statistical moments like skewness and kurtosis. In addition to these statistics, other features, including percentiles, auto-correlation coefficient, Pearson’s correlation coefficient, as well as linear and quadratic regression coefficients, have been used in [30]. Furthermore, features such as the average absolute difference, average resultant acceleration, time between peaks, and binned distribution have been employed in [4].
The features mentioned above are dependent on the specific problem and may not be sufficient for detecting more complex activities. To overcome this limitation, researchers have explored features in the frequency domain. The Fourier transform is one of the most widely used transforms in this regard and has been employed in studies [13]. After applying the Fourier transform to the raw signal, features such as transform coefficients, energy, entropy, and DC components are extracted.
The Fourier transform can accurately extract the frequency characteristics of a signal but lacks information on when the frequency occurs. To retain information in the time domain, multi-resolution analysis methods such as the short-time Fourier transform and spectrograms are utilized. Laput and Harrison combined time-frequency spectral features with convolutional networks to achieve 95.2% classification accuracy over 25 atomic hand activities of 12 people [16]. Similarly, Fan et al. [7] proposed the development of time-angle spectrum frames to represent spectral power variations along time in different spatial angles of the RFID signals. Zheng et al. [35] proposed varying levels of interference to the spectrum of radio signals, capturing the crucial frequency variation over time. The method utilizes deep learning, employing short-time Fourier transform to transform the signals into the frequency domain. Augmented signals are reconstructed using inverse Fourier transform and used alongside the original signals to train a deep neural network.
Deep learning in human activity recognition
To overcome the limitations of manual methods of feature extraction, the use of deep neural networks for feature extraction has gained considerable attention. Various types of neural network structures have been utilized in human activity recognition, including deep feed-forward, recurrent, convolutional, generative, and hybrid models. These models are capable of automatically learning features that are most relevant to the recognition of human activities, without the need for human experts to manually extract features.
In their study, Hammerla et al. [11] compared the performance of deep feed-forward, convolutional, and recurrent networks in extracting features for HAR. The results indicated that convolutional and recurrent structures outperformed the feed-forward network. This is likely due to the fact that fully-connected networks typically have a large number of parameters and may not be optimal for feature extraction. Instead, they are often used for classification purposes.
Human activity recognition involves processing temporal data, making recurrent networks an attractive option due to their ability to handle sequential data. However, training recurrent networks can be challenging due to the issue of vanishing gradients, and they often require high computational resources. To optimize neural network performance, Edel et al. proposed a model where the weights of all layers were binary (0 or 1), which allowed for significant reductions in computation and memory usage while maintaining accuracy [6]. In addition, Yuan et al. introduced a recurrent neural network with multi-view attention mechanisms for multiple time series in [29], achieving promising results in the human activity recognition problem. Recently, several studies have proposed combinations of recurrent and convolutional networks for human activity recognition [18,20,28]. In Mutegeki et al. [20], raw signals were first filtered by a 1-D convolutional network, and then, the Long Short Term Memory (LSTM) network was used to extract temporal features. Li et al. [18] improved upon this approach by replacing the convolutional block with the residual convolutional block and using bidirectional LSTM cells instead of traditional ones. In contrast to these methods, Xia et al. [28] first extracted temporal features using an LSTM network, and then captured spatial features using a convolutional network.
Convolutional networks have several advantages over fully connected networks, including sparsity of connections, parameter sharing, and space-invariant representation. These networks have been successfully applied in audio and image processing tasks. To utilize convolutional networks in human activity recognition, the data dimensions and parameter sharing need to be considered. Wearable sensor data typically consist of one-dimensional time series data. There are two general approaches to preparing data for a convolutional network: data-driven and model-driven [26].
Data-driven methods consider the data of different sensors in separate time series. After applying one-dimensional convolution to these time series and merging them, classification is performed. An example of this method was examined in [31]. Parameter sharing between different sensors was examined in [10,31]. Chen et al. replaced the one-dimensional convolutional layer with a two-dimensional one and obtained better results compared to manual feature extraction and the use of classifiers such as the support vector machine [5]. However, data-driven methods do not consider the relationship between different sensors, which can impact the model’s performance.
Model-based methods consider the model to be constant and vary the data according to the model. In [13], different time series and their permutations are stacked line by line to yield a signal image. The Fourier transform is then applied to the image to obtain the signal image. Finally, feature extraction and classification are performed with a two-dimensional convolutional network. Although these methods outperform data-driven methods, they require expert knowledge to generate appropriate inputs.
To capture longer-term dependencies and increase the convolution kernel receptive field, dilated convolutional networks have been proposed. Hamad et al. [1] utilized a dilated causal convolutional network with a self-attention mechanism to reduce the number of convolutional network parameters. This approach allowed the model to learn both short-term and long-term temporal features efficiently while avoiding overfitting.
Tang et al. [25] proposed a triplet cross-dimension attention model for HAR that incorporated three attention parts to enable cross-interaction between sensor, temporal, and channel dimensions. The model was tested with different backbone structures, including simple and residual convolutional networks, and showed promising results. However, one of the main challenges of these methods is the complexity of the attention operation, which increases exponentially with larger input sizes. Zhang et al. [32] developed a system that combined the concept of CNN and attention mechanism for activity recognition using the data from a smartphone. Here, the attention is incorporated into multi-head CNNs that facilitate extracting and selecting features efficiently trained HAR models face a significant challenge in that the performance of the classifier is highly sensitive to the context of the sensor and engineered features. Rokni et al. [2] proposed personalizing their models with transfer learning. During training, a CNN is first trained with data collected from a few participants (source domain). During the test phase, only the top layers of the CNN are fine-tuned with a small amount of data for the target users (target domain). Annotation for target users is required. To address distribution discrepancies, a generative adversarial network (GAN) can also be used. Sah and Ghasemzadeh [23] demonstrated that activity recognition models are highly vulnerable to adversarial attacks and proposed a robust adversarial training model. Similarly Zheng et al. proposed DL-PR (Deep Learning with Priori Regularization), a deep learning method for automatic modulation classification (AMC) in cognitive radios. DL-PR incorporates a priori regularization that guides loss optimization during model training by increasing inter-class distance and reducing intra-class distance. This regularization factor utilizes inter-class confrontation, global divergence, and dimensional divergence [34].
Wavelet transform
Like other transforms, the wavelet transform is defined as continuous and discrete. The continuous wavelet transform is calculated using Eq. (1):
The proposed method
In the proposed approach, the first step is to preprocess the data, followed by generating an activity image. Then, a convolutional network with a modified cost function is used to extract the features. This section presents an introduction to the proposed method and its key components.
Preprocessing and sample generation
The first step of the proposed approach is to generate samples from the raw signal. The raw signal is segmented into segments by a sliding window to form the set of samples. Depending on the evaluation method, these samples are then divided into training and testing samples for model training and testing. Sliding windows can be either full-non-overlapping or semi-non-overlapping. Since the semi-non-overlapping window generates more samples, this type of segmentation was used to generate the samples. Considering the iterative nature of the tasks and based on the recommendation of [14], the size of the sliding window is set to 5 seconds with a 50% overlap.
After generating the samples from the raw signal, the proposed approach generates an activity image by applying the wavelet transform with different sizes. The samples generated in the previous step include several time series, each corresponding to an axis of a sensor. The wavelet transform generates an image for each time series, with the horizontal axis representing time and the vertical axis denoting the frequency components. The number of frequency components is equal to the number of wavelet transform scales. The size of the horizontal axis is determined by the sampling frequency of the dataset and the size of the time window.
In this operation, the process is repeated for each time series, and the resulting images are stacked as separate channels to create a single activity image. To ensure consistency across different datasets, the image is converted into a square shape, where the vertical axis representing the wavelet transform scales remains constant, and the horizontal axis (time) is transformed into the number of scales using interpolation. Since the cost function used in the proposed method is dependent on the dimensions of the feature space, it is important to equalize the dimensions of the activity images from different datasets to enable comparison. Additionally, the design of the convolutional network should be adapted to accommodate different data dimensions.
For instance, the USC-HAD dataset contains 6 time series recorded at a frequency of 100 Hz. By applying the wavelet transform with 64 scales to each series, the resulting image will have dimensions of

Sample activity images generated for walking, standing and running from USC-HAD dataset.
Given that the input is transformed into an image, the convolutional network is a good choice and is, therefore, used in the proposed method. The proposed structure consists of several convolutional blocks which are introduced below.
In each block, the input to the convolutional block is filtered by several 5*5 filters and the ReLU function is applied to them. Finally, a 2*2 maximum pooling layer extracts the most prominent features.
The proposed structure consists of five convolutional blocks, each with the same internal structure. For experiments with an input size of
After the features are extracted by the convolutional blocks and converted into a one-dimensional vector, a 512-dimensional representation vector is created for each activity. A fully connected linear layer is then placed for the final classification. The number of outputs of this linear layer is equal to the number of activity classes, indicating the probability that the sample belongs to each class.

Proposed convolutional network.
The network structure and cost function are two factors that significantly impact the final performance of the model. To optimize the model’s performance, we developed a loss function based on the angular triplet centered loss (ATCL) proposed by Li et al. [19]. This cost function aims to train the model to generate feature representations that are more similar for samples belonging to the same class while keeping the representations of samples from different classes as far apart as possible. The class centers are defined as learnable parameters, enabling the distance between the centers to be maximized by adding a constraint to the cost function. The final cost function is calculated based on Equation (2):
In the second part of the Eq. (2), C is the number of classes. This part is equal to the average distance between the centers; thus, the average distance between the centers will increase as the overall cost function is minimized. Since the centers are parameters that are not constant during training and are trained, the training will become unstable. Therefore, to maintain stability in training, the softmax function has also been added because this cost function has no learnable parameter and will generally be stable.
In each stage of the training, the value of the cost function is first calculated for a batch of samples. Subsequently, the cost function gradient is calculated with respect to all the model parameters, including convolutional network parameters and cost function parameters (centers). Finally, the parameters are updated based on the gradient values and by using the Adam optimizer [15].
The initial value of the learning rate is set to 0.001. To avoid entrapment in local optima, the learning rate is exponentially reduced according to Eq. (3):
In this section, we discussed the proposed method and its components which includes activity image generation, convolution network architecture, and model training.
Experimental results
In this section, we first introduce evaluation methods and then examine components of the proposed method. Finally the proposed method will be compared to the baselines methods.
Evaluation method
Evaluation of human activity recognition models is challenging as there are no standard guidelines for this task. Unlike image classification problems, there is no separate evaluation set available for this task. To evaluate and compare different models, various methods have been proposed, including K-fold cross-validation, leave-one-trial-out cross-validation (LOTO), and leave-one-subject-out cross-validation (LOSO). These methods are explained in more detail below.
In K-fold cross-validation, the samples are randomly divided into k subsets, and one subset is used for evaluation while the others are used for training. This process is repeated k times, with each subset used once for evaluation. The evaluation criterion is then averaged across all k folds to estimate the model’s performance. However, due to the overlap of generated samples, there may be data leakage between the training and testing sets, which can lead to overestimating the model’s performance.
To address the issue of data leakage in training, the leave one trial out cross-validation method has been proposed [14]. In this method, the raw data is first divided into trials, where each trial corresponds to a sequence of activities. Then, samples are generated using a sliding window technique, and as in the K-fold method, a test set is selected at each iteration as the evaluation set. The final evaluation metric is calculated as the average of the performance on all test sets. However, it is possible that samples from the same person exist in both the training and testing sets, depending on how the trials are divided, which can result in data leakage.
Another method for evaluating the performance of the model is the leave one subject out cross-validation, which is similar to the K-fold cross-validation but the division is based on the subjects. In each stage, the data of one subject are considered as the test data, and the model is trained with the remaining data. This method provides a more accurate estimate of the model’s performance as it is closer to reality. Therefore, we chose to use this evaluation method for our study.
In classification problems, accuracy is the most commonly used evaluation criterion, which is defined as the ratio of the correctly predicted samples to the total number of predictions. In this research, accuracy is the evaluation criterion, as defined by Eq. (4). Additionally, to evaluate the performance of the cost function, the reduced-dimension representations of the features will be compared visually.
The tested datasets are MHealth [3],1
Available online at
Available online at
Available online at
The impact of image size on model performance was evaluated by comparing activity images generated at 32 and 64 pixels. Results are presented in Table 1. Smaller images provide lower resolution and less information, thus increasing image size can improve model performance. Increasing the image size results in more distinguishable activity images. However, larger image sizes result in higher computational complexity, so sizes larger than 64 pixels were not used in this study.
Effect of activity image size on model accuracy
Effect of activity image size on model accuracy
The choice of wavelet transform mother function is crucial in extracting time-frequency features. To investigate this effect, three mother functions: morlet, Gaussian, and Mexican hat, are compared in Table 2. The results indicate that the selection of the initial image generation plays a vital role in the final performance of the model. Among the mother functions examined, the Morley function under performed. Meanwhile, the Mexican hat and Gaussian functions performed similarly. Consequently, the Mexican hat mother function is adopted for the subsequent experiments.
Effect of wavelet mother function on model accuracy
The proposed cost function uses an angular distance that is bounded between 0 and 1. The margin value emphasizes that the difference between the distance of the sample from its center and the nearest center of other classes should be greater than the margin value. If the margin value is 0, the optimal point of the cost function is where all the representations are the same and all distances are zero, which is a wrong optimum. Three values of 0.4, 0.5 and 0.6 are investigated for the margin and the final results are compared in Table 3. The learned representations for different classes of activity are also presented in Fig. 4.
Effect of margin value(m) on model accuracy

Learned representation with proposed loss function and m is 0.6, 0.5 and 0.4 form left to right respectively.
The qualitative analysis of the representations reveals that the models have relatively similar performance. However, it should be noted that this representation is limited to three dimensions and does not capture the full feature space. Quantitative comparison of the results shows that a margin value of 0.5 provides the best performance.
In the proposed cost function, the weight of the first part relative to the cross-entropy is denoted by λ. Increasing this coefficient means that the model will pay more attention to reducing the first part of the cost function. However, assigning large values to this coefficient can cause instability in the training process, even though the ultimate goal is to reduce the first part of the cost function. In our experiments, the value of the second part of the cost function was nearly 10 times that of the first part. Therefore, we examined three values of λ (10, 20, and 50) and compared the final results in Table 4. The learned representations for different classes of activity are also shown in Fig. 5.
Effect of λ value on model accuracy

Learned representation with proposed loss function and λ is 50, 20 and 10 form left to right respectively.
To investigate the effect of the proposed cost function, we kept the input and model constant and compared the performance of three different cost functions: cross-entropy, angular triplet center loss (ATCL), and the proposed cost function. The final results of this comparison are presented in Table 5. For a visual comparison, we also display the learned representations for different classes of activity in Fig. 6.
The comparison of the learned representations indicates that the proposed cost function has produced denser representations, where the samples of a class are closer to each other and farther from other classes. This can enhance the model’s generalizability for evaluation data, as it promotes the discrimination between different classes. A comparison of these confusion matrices reveals that utilizing the suggested loss function enabled the network to learn more distinguishable features, resulting in improved classification accuracy (see Tables 6 and 7). A comparison of these confusion matrices reveals that utilizing the suggested loss function enabled the network to learn more distinguishable features, resulting in improved classification accuracy.
Effect of loss function on model accuracy
Effect of loss function on model accuracy

Learned representations with: proposed loss function, ATCL and cross entropy from left to right respectively.
USC-HAD dataset confusion matrix using cross entropy
USC-HAD dataset confusion matrix using proposed loss function
Based on the experiments conducted, the parameters of the proposed model are defined as shown in Table 8. The baseline methods compared include studies by Chen [5], Jiang [13], and Ha [10]. The final results are given in Table 9.
In Table 9, the proposed method was tested on the WISDM dataset using an activity image size of 32 due to hardware limitations. Nevertheless, the accuracy of the proposed method outperformed the baselines. Comparing the accuracy of the baselines in Table 9 with that of the proposed method, considering different cost functions in Table 5, indicates that the generation of the activity image in the proposed convolutional structure has enhanced the model’s performance. Moreover, incorporating the developed cost function has further improved the model’s prediction accuracy.
Hyper-parameters of proposed model
Hyper-parameters of proposed model
Comparison of proposed model and baselines
To exploit the temporal and frequency features of the signals recorded by sensors, the proposed method adopted the wavelet transform which is a multi-resolution analysis method. After generating the activity image by using the wavelet transform, feature extraction and classification were performed by a convolutional neural network. For the network training, a cost function based on the angular triplet centered loss was presented to learn denser representations. The results revealed the superiority of the proposed method to the baselines.
The proposed method is divided into three parts, and suggestions are made for improving each part.
In the first part, the wavelet transform is utilized to generate the activity image that is fed into the convolutional network. The selection of the wavelet transform mother function is crucial and depends on the application domain. To further enhance the performance of the model, it is recommended to explore various functions or even introduce a new mother function that can be learned by the network. Moreover, other transforms such as the curvelet transform can also be explored as an alternative to wavelet transform to improve the quality of the activity image. As the input is converted into an image, it allows the use of many models proposed in other research areas for human activity recognition. Combining the idea of multiple frequency analysis with convolutional networks, a wavelet convolutional network is presented which can be a proper option for this problem [8].
In the field of human activity recognition, there are several other challenges that need to be addressed. These include online activity recognition, unsupervised learning, and developing more optimal and smaller models that can be efficiently used in smartphones.
Conflict of interest
None to report.
