Abstract
Today’s healthcare sectors are driven and work to rescue patients as soon as possible by giving them the right care and treatment. A healthcare monitoring system works in two ways: by keeping track of the patient’s activities and overall health. For prompt treatment, such as giving the right and suitable medication, administering an injection, and providing additional medical help, nursing supervision is required. Wearable sensors are fixed or connected to the patient’s body and can follow their health. These IoT medical gadgets let clinicians diagnose patients and comprehend the processes from remote. However, the amount of data produced by IoT devices is so large that it cannot be handled manually. A model for automated analysis is required. Convolution Neural Network with Long-Short Term Memory (CNN-LSTM) was therefore suggested in this study as a Hybrid Deep Learning Framework (HDLF) for a Patient Activity Monitoring System (PAMS) that brings all healthcare activities with its classes. To incorporate medical specialists from all over the world and enhance treatment outcomes, the framework offers an advanced model where patient activities, health conditions, medications, and other activities are distributed in the cloud. An effective architecture for Wearable Sensor Network-based Human Action Recognition that combines neural network Simple Recurrent Units (SRUs) and Gated Recurrent Units (GRUs). For assessing the multimodal data input sequence, deep SRUs and a variety of internal memory states is utilized in this research. Furthermore, for addressing the concerns about accuracy oscillations or instability with decreasing gradients, a deep GRUs to store and learn the knowledge is conveyed to the future state. The analysis suggests that CNN-LSTM is then contrasted with some of the currently used algorithms, and it is found that the new system has a 99.53% accuracy rate. The difference between this accuracy result and the current value is at least 4.73%.
Keywords
Introduction
Neural Networks are part of the Machine Learning Models (MLM). Artificial Neural Networks (ANN) and Simulated Neural Networks (SNN) are other names for neural networks. The significant key components of Deep Learning (DL) is that it is a subset of Machine Learning (ML) field as like ANN. In the near future, the healthcare industry will employ machine learning algorithms a lot. In order to construct clinical decision support systems and learning models to increase the effectiveness of application requirements, certain ML techniques have been applied. For instance, SVM is utilized in ML algorithms to categorize different types of breast cancer.
The SVM, on the other hand, must be combined with ANN models in order to provide reliable data information. Then, in the clinical industries, these algorithms are used to monitor and find anomalies in a variety of patient behavioural patterns. Therefore, the purpose of this work is to employ Machine Learning methods to track and identify patient activities as shown in Fig. 1.

Proposed model for Patient Activity Monitoring.
A wide range of cutting-edge applications, including natural language processing, computer vision, regression, and time-series forecasting, employ machine learning methods. It has been discovered that the neural network behaves similarly to how the human brain does, identifying patterns and correlations among data. The neural network technique facilitates the “Image Processing and Classification” problem. The TensorFlow framework and Python programming of the Keras library enable the creation of Neural Network models even more straightforward and straightforward. The necessary findings may be efficiently analyzed using this Python package, and ANN is utilized to create accurate prediction models. On our GPU, it also does parallel processing. CNN (Convolutional Neural Network) can extract characteristics from videos, which are made up of a collection of individual pictures, using a DLM. Once the data has been processed and classified, a precise prediction analysis may be offered.
The patients will be wearing certain wearable sensors that are connected to the internet for regular monitoring of the patient activities, per the suggested system (Ref. Fig. 2). The actions are tracked by wearable sensors and updated on the data centre server, where the physicians may access the information to conduct additional analysis and continue the patient’s therapy.

Components involved in PAM.
With the aid of the relevant films, the suggested neural network model can identify the stretched human body movement. The images in the films, which needs to be categorized as “stretched or No stretching” as the specific motions or tasks performed by the humans, have been captured in 280 movies with a typical duration of 13–15 seconds. Instead, we focused on one human object detection in the film to make it easier to grasp. The majority of the work has been done in Python using the Keras module for this assignment. For processing video data, many deep learning architectures are tested. The architecture is being developed in five stages, from simple to complicated. This is a test of one. The various training models are used to identify human activities. Then, with remarkably precise results, we can anticipate certain video frames using the common categorization model. In order to execute the specific video categorization, we have simply retrieved the data from the movies and created a simple Deep learning architecture. Figure 3 provides a top-down view of the workflow, which facilitates comprehension of the entire architecture.

Video Processing and Classification Task.
Before beginning their therapy, patients should stretch while engaging in regular activities. The goal of this research project is to monitor the healthcare sector by installing an intelligent camera that records footage and sends it to a system engineer or administrator. The system administrator or engineer analyzes the video file and foresees the outcomes. Due to the large amount of frames, the memory management issue is also resolved together with the categorization process. Classifying the activity in the video poses the biggest issue for this article. Consequently, the complete procedure is divided into three parts in this paper: (i) setting up the data (ii) Learning, deciphering, and categorizing the frames, followed by (iii) LSTM-based classification.
This paper’s main objective is to develop and apply a unique hybrid Deep Learning System for action recognition in videos. Additionally, extending a body demonstrates the movement. In order to identify the stretched body, it is therefore necessary to build and deploy a hybrid deep learning classifier.
The major goal of this research is to improve the effectiveness of the patient activity categorization procedure. Based on the activities, the medical professionals might take prompt action to care for them. Therefore, a CNN-LSTM hybrid model was suggested in this article to analyze and categorize the pictures extracted from patient monitoring recordings. The input video file is loaded and read before being divided into segments and then frames as part of the paper’s contribution. Then, make sure that each frame and each section of the video are the same size. The unnecessary object-based video frames are finally removed to the greatest extent possible. The CNN algorithm receives the revised batch of video frames to begin learning and analyzing them. The LSTM is then given the CNN output to increase the classification’s accuracy.
This research is innovative in that it builds hybrid deep learning models to improve classification accuracy while using video frames as input pictures. The effectiveness is assessed by contrasting the outcome with the already used algorithms.
Literature review
In order to comprehend the issue statement of the video processing system and PAR, this work conducted a thorough analysis of several prior research techniques. For instance, the patient monitoring system and aberrant activity identification based on the CNN model were covered by Malik Ali Gul et al. [1]. In order to train the CNN, the authors labelled individual frames in a huge dataset of people’s movies depending on the layer and the activities of the subjects. The proposed method has an efficiency of 89.2%. Using CNN for face pictures, Marco Bellantoni et al. [2] concentrated on spatiotemporal feature learning for image identification. It offers crucial details about how well the CNN algorithm performs when detecting anomalies in facial photos with high resolution. Last but not least, Jiahui Huang et al. [3] introduced a novel Two-Stage (TS) end-to-end CNN model with high computation and accuracy for human activity identification.
A technique based on CNN-LSTM was suggested for recognizing human activities by Ronald Mutageki et al. [4]. It also contrasted the results of its modelling techniques with those of other machine learning techniques. Deep Learning, CNN-based sensor blending approaches for multimodal human activity detection were described by Munzner. S et al. [5]. Additionally, it included certain modelling normalization strategies. The fusion approaches were utilized to analyze the datasets used in the training phase. In their discussion of the LSTM-CNN architecture for HAR, Kun Xia et al. (Human Activity Recognition). Convolution layers with batch normalization were discussed. The findings demonstrated that the suggested model is more accurate than alternative models. Convolutional neural networks with gyroscope sensors and several accelerators were explored by Sojeong et al. [7] for recognizing human activities. Additionally, it covered CNN’s multimodal features and pooling processes. CNN employs mobile sensors for human activity identification, according to Ming Zeng et al. [8]. It depicts the local dependence and scale invariance of image recognition domains.
A 1D, one-dimensional CNN model for HAR was suggested by Heeryan Cho et al. [9]. In order to enhance the activity of action recognition a test data is introduced for sharpening of precise prediction. Thorat, Ninad, and colleagues [10] concentrated on deep learning-based Human Activity Identification. For improved predictions, the performance is tested with CNN and LSTM. SK Using wearable sensors, Challa et al. [11] and the authors N Dua et al. [12] suggested a multi-branch CNN model for HAR. Using model networks, it seeks to comprehend how people behave. In this research, Deep Learning Techniques are used, particularly for feature extraction. Human activity recognition using a multi-task learning architecture based on CNN and LSTM was explored by Xi Ouyang et al. [13] and Samundra Deep et al. [14]. Using MTL (Multi-Task Modelling), the sequential characteristics in the movies are extracted. LSTM and CNN-based models for psychological analysis were described by MN. Dar et al. [15]. It is useful for ECG and GSR time series data in hospitals. Numerous algorithms have been previously presented for hospital monitoring, video processing, health condition monitoring, and human activity monitoring. These algorithms are similar to the research methodologies mentioned above. Image processing, object detection, and recognition are a few of the techniques. Following the debate above, it was agreed to integrate object identification and recognition, video processing, Human Activity Recognition (HAR), and surveillance monitoring. The Patient Activity Monitoring System (PAMS) is also introduced.
S. Wang et al. [24] have designed an automated segmentation of the Lesion from the Endoscopy images of the Gastro-intestinal tract. Multi-Scale Context-Guided Deep Network (MCNet) was proposed by the authors of [24] to perform the automated segmentation and observed that the proposed models achieve 74% and 85% of mean intersection above union for two datasets. Zhang et al. [25] perform an early clinical diagnosis of atrophic gastritis. This experiment is carried out by implementing improved DenseNet that has provided an accuracy of 98.63%. When this enhanced model is combined with serological indicators, enhanced accuracy of 99.25% is achieved. Cheng, X., [26] has consolidated various applications of 5 G IoT systems in the context of Deep Learning Technology. This work is a collection of applications related to IoT devices, wearable electronics, and the 5 G telecommunication solutions. Zhang F et al. [27] have focused on anomaly detection on the attributed network. The authors have proposed a Deep Dual Support Vector Data description-based Auto-encoder (Dual-SVDAE) that works with the structure and attribute auto-encoders to perform anomaly detection. The authors have fused the auto-encoders with the embedding concepts to generate node attributes as an extension. At the same time, Abbas et al. [28] have performed Patient Activity Recognition in the context of Space-Time Templates. The authors have combined the functionalities of Motion-Density Image, Motion-History Image, and Linear Discriminant Analysis to execute the task of activity classification. The model is compared with the SVM and observed that the proposed system had attained 97.9% accuracy in classification. In [29], M. Fasko et al. have performed exploratory works on activity recognition by implementing the application using Computer Vision, OpenPose, and TensorFlow. W. Duan et al. [30] have reviewed emerging technologies for 5G-IoV networks in the context of applications, trends, and future opportunities and focused on Software Defined Network (SDN) and Multi-Access Edge Computing (MAEC). According to the technology development there are enough hardware and software are getting developed in recent days, in such case health care monitoring is mostly spoked in this article A. Rghioui et al. [31], the author have deals about the connection between Wi-Fi to the internet connection sensors. G. Santarsiero et al. [32] the author have represented some of the application about machine learning and a major technique that consists of structural health columns with the joints, most of the machine learning models do performs the computational effort than numerical models too. Diabetic patients do requires some of the systematic cautions to play a substantial role in monitoring the patient health conditions and reporting into the doctors, in Rghioui. A et al [33] have introduced a machine learning algorithm using Node MCU. In recent days most of the educational institutions there are enough managing records related to the student presence on each day, according to that this system is completely based on face detection and recognition of algorithm according to the models predicted D. S., V. A. Devi, et al. [34]. One of the common problems to be related with the recent technology is that managing the records and evaluating it without the human interaction, most of the systems used to predict the cause but it waits until the user reports to analyze such kind of data. Through this paper H. W. Ahmed [35] & authors of [36] have created an Environment to focus on pollution issues in the country.
One of the most common advantages is creating a non-contact with precise energy with the deposition, here in this paper the author of [37] have discussed about the transmission welding process and about the depletion of Deep Learning algorithms with the required system. Traffics are common to occurs, but accidents are most rare things, in such case in recent days most of the countries are reporting enough accidents in their state, to avoid such kind of issues by A. Manikandan, et al. [38] have introduced deep learning module to manage the traffic monitoring and reporting it with the higher departments. Patient monitoring system becoming most common thing, but in recent days according to the patient’s disability the reports are extending its availability, according to that heart disease prediction system is mostly encountered in this paper by S. S. Sarmah [39] & P. Rajan Jeyaraj, et al. [40]. Serpush, F., et al. [41] have made a detailed review on the HAR in the healthcare system constructed in different dimensions that includes activity recognition components and the types of architectures proposed for the Smart Health Care System (SHCS). Madhu, G., et al. [42] developed a model Deep Siamese Capsule Network (D-SCN) to detect and classify the malaria thin blood smears and attained 97.24% classification accuracy with the similarity metrics with Lorentz. Raheja, S., et al. [43] have proposed diffusion prediction model for corona virus pandemic. Shrestha, H., [44] focused on Deep Learning - based Convolutional Neural Network (DL-CNN) for analyzing the medical images and accomplished an accuracy of 98%. Thapliyal, M., et al. [45] designed an Intelligent Tutoring System (ITS) for the learning environment and evaluated based on Fuzzy Sets. Abad-Segura, E., et al. [46] have discussed about the healthy nutrition education for women, infants and children.
Limitation & motivation
It is evident from the discussion above that no one, effective approach for identifying human activity meets the needs of either users or applications. Furthermore, the scalability and ease of deployment of the current object identification technologies are also inefficient. Additionally, because to the abundance of available data, current research projects attempt to create and deploy patient activity detection systems with greater efficiency than ever before in order to address a variety of problems and application-based situations. Consequently, the hybrid deep learning model (CNN-LSTM) proposed in this research functions as an efficient and effective model capable.
Proposed method
The suggested model’s initial step is the creation of the data points, which the model demonstrates (see Fig. 4) the data collection and stored in the database. The data is produced by many sources and sent to the database. In certain cases, data repositories house the data. Sometimes they are collected by hand or by programming, and then they are centralized storage on the cloud. Samples, also known as data points or observations, are the dataset’s fundamental building block. By analyzing the system units for the training dataset, the full data points reflect the system unit utilized to build the training data. Based on the illnesses found in the data sample, the data point can identify patient information. Access to data points, whether they are labelled or not, has been growing quickly in recent days among the healthcare industry. A pair of characteristics known as label and response/output are included in the labelled data. The data class, such as normal or abnormal, or male or female, is shown on the label. The number of labels used in this study to categorize the patient activities. PD is used to denote the Patient Data.

Patient Monitoring System (PMS) Model.
With the use of ML approaches, data and datasets are gathered from numerous sources in order to construct an analytical model. The acquired datasets are kept initially in the centralized cloud. The samples and observations make up the datasets’ fundamental building blocks. Computing the specified system units creates the training datasets. The datapoint represents the samples as well as the observations and inputs. The whole data needed to create the ML model is shown in Fig. 5. Patients The centralized cloud maintains data such as data from patient admission through discharge from the hospital and other medical track records. Data from ambulances, hospitals, intensive care units, Electronic Health Records (EHRs), Electronic Medical Records (EMRs), patient data monitoring, additional medication track records, etc. are included in this. The various terms used in this work are listed in Table 1 so that readers may comprehend it easily.

General Method of MLA.
The terminology used in the study
Many data points are now collected from many healthcare institutions. Datapoints are divided into two categories: labelled and unlabeled. Any of the terms listed below constitutes a label: Feature, output, response, and dependent variable. Additionally, it might be ordinal or categorical. The definite group comprises the ordered set of predetermined values in ordinal form, whereas it is an unordered collection of predefined values. Figure 5 shows the six phases required to solve a problem using supervised learning. These processes include data collection, data preparation, model design, model training, model testing, and performance evaluation. The video files that were processed and turned into video segments serve as the study’s data source. Following the division of the video segments into frames, the input pictures will be examined to determine the activities. The right model is then chosen, and it is trained with the training dataset. Using the testing pictures, the trained model is assessed and predicts the action class. Training data is information that is used to effectively predict outcomes using machine learning algorithms. From human specialists, training sets are gathered and analyzed. Choose the best algorithms to use after acquiring the data sets, and then validate the outcomes using ML methods. By utilizing ML approaches, the output is also assessed to obtain greater performance as compared to the prior output.
In contrast to image processing, each video is processed and the data is classified using several frames (images). In light of this, there is only one categorization for different video frames. For processing many photos at once, different architectural models are available. Convolutional neural networks (CNNs) and LSTMs, a 3D-CNN layer, and numerous sophisticated designs are a few examples. The goal of this study was to create a CNN-LSTM model to categorize the provided video for PAR.
For Recognizing Human Activity, CNN is used (HAR). To recognize the many patterns in the wearable sensor data utilized for PAR, it has a wider viewpoint. The best model to date for processing the raw data from the signals and predicting activity is CCN, which entails predicting the movement of the patients based on data from wearable sensors. It automatically picks up characteristics from the input sensor data from the wearable. As shown in Fig. 6, this study employs both CNN along with LSTM networks to acquire the input data characteristics for predicting the related movements of the patients.

Wearable sensors in Real-Time Monitoring.
The processing units in this architecture are located in the lowest layer and gather information from wearable sensors to analyze feature mobility in patient activity. High-level signals are obtained by the processing units in the upper layer, specifically to characterize the patients’ numerous basic motions. Each layer has a number of convolutions or pooling operators (mostly with different input and output sizes). CNN is simple to understand and may be used to extract a larger number of characteristics (Ref: Fig. 7). Obtaining the multiple characteristics from diverse datasets at various time periods is simple when convolutions and pooling procedures are combined. Nevertheless, regardless of the sizes or placements of the signals, the pattern of the signals is taken into account. Multiple channels of temporal signals are utilized in PAR. The main obstacles in our work are using CNN to address the time dimension and making CNN as scattered as possible.

Feature Mapping Process in CNN.
Here, the temporal signals are divided into a segment of the short signal using the sliding window approach. It is important to note that a CNN instance is a two-dimensional matrix with raw (r) samples and D characteristics. The experimental setup determines the sample kernel size, and the sampling rate is indicated by the row r. CNN requires larger training datasets since there are so many factors to consider while learning the features. Increased training instances should be used in CNN to enhance test size.
The Ground Truth (GT) matrix label is taken for data training. Later, comparison is made between the feature of the input matrix and the GT matrix label. Feature mapping in the convolutional Layer is illustrated in the Fig. 7. For illustration, the
Numerous neural network kernels are convolved with the characteristics of the convolutional layer. It must be learnt via the process of training. The variable
The feature mapping model assists in minimizing the dimensionality of the input data. Thus, the pooling layer does temporal analysis among neighborhood data characteristics using the pooling approach as in Equation (3).
where,
The layer’s height, breadth, and channel count are multiplied to determine the activation size. The input layer’s form is depicted in Fig. 7 as (32, 32, 3). The activation layer measures 32×32 * 3 = 3072 in size. Similar to that, the dimensions of the activation layers were determined.
The activation size for CONV2 considered as (10, 10, 16), or 10* 10* 16 = 1600. Therefore, it is evident that the input layer contributes to the format of the input image. The learning weight matrices are provided by the CONV layer. Height, width, and other filters are multiplied by the parameter while calculating it. Pool layers calculate a given number and do not have any learnable parameters. There are parameters that may be learned, and a Fully Connected Layer (FCL) has the most parameters of any other layer. It is calculated as (Neurons of the current layer times neurons of the preceding layers) + 1 * c. Table 2 provides the kernel size for each layer is suggested in CNN. This table gives a description of the layers, layer size, activation size, and number of parameters utilized in the research.
CNN Parameters
The CNN architecture may be modified according on the needs of the user or the application by adjusting the kernel size in any layer. The amount of CNN’s incoming data feed is another factor. Convolution is used in multiple frames to learn the features in each layer while concurrently applying the convolution process to all 20 frames. The Keras library’s Time Distributed Layer (TDL) is utilized for this. N temporal dimensions are affected by the convolution procedure (that is, convolution). In this movie, there are 20 frames total. In order to extract the features from the input picture, the convolution layer is utilized. The temporal component of the frame is preserved via the TDL, or Time Distributed Layer. This implies that will recognize the patient’s body in the first frame and learn how it varies in later frames.
LSTM design has shown to be efficient in obtaining temporal data about HAR. The LSTM unit may decide whether new data should be added to the current memory or if it should be kept. In order to prevent the disappearing or ballooning gradients problem during training, LSTM-RNN can establish long-range dynamic dependencies. The sequence input layer, the LSTM layer, the FC layer, the SoftMax layer, and the classification output layer are the main constituents of LSTM network when it comes to time series classification. More information is available in Rashid and Louis.
Since the video data contains spatiotemporal information, Patient Activity Recognition (PAR) utilizing LSTM- CNN is a successful method of data classification. These characteristics are learned using the LSTM model since it has sufficient memory and allows for recurrent comparison. The LSTM is one of the Recurrent Neural Networks (RNN), and as such, it is capable of memorizing and learning a significant quantity of data that is continually created from the input source. The frames from the PAR video that are sent to CNN are the input image. With the help of the ImageNet database, CNN is pre-trained. The FC layer receives the features that have been learned, extracted, and selected by the CNN’s convolution, pooling, and other layers. The LSTM model, which is seen in Fig. 8, will receive the input from the FC layer, which converts the input into a linear vector. Since video data produces the aforementioned types of information, the goal of this work was to employ the LSTM model to increase classification accuracy. The raw datasets are loaded into the RAM for LSTM. Datasets for the train and test runs are then loaded. Modeling may now be done with the data that has been placed into memory. Now that the LSTM model has been determined using the Keras deep learning library, the system can define, fit, and assess it. The model needs the Three-Dimensional (3D) input as the form of [samples, time steps, and features]. The CNN base and the LSTM layers were two distinct concepts that were incorporated in the model. The LSTM layer correctly categorizes the features using the CNN as a base model. For increasing proposed CNN-LSTM model’s efficiency, the encoder is fed to the training output of the CNN beside the hidden state value of the LSTM. The final projected outcome was then produced by the decoder.

Architecture of the Proposed CNN-LSTM.
Figure 8 depicts the CNN architecture, which consists of five distinct layers, each with a specific function. As follows:
Machines have difficulty comprehending videos. The models’ construction by data engineers is complicated by their dynamic nature. We constantly look at sets of frames in the form of films, which are obviously a set and collection of images grouped serially. An image classification issue is comparable to a video classification issue. Only the photos collected with the aid of feature extractors are used in image categorization (CNN). Video categorization involves a single-step process that entails merely taking into account the frames from acquired videos.
The video file is transformed into frames in the first step. The features are taken from the video frames and implemented using Python by following the procedures listed below. The video files on the disc may be read using the Python OpenCV library with the assistance of a few lines of comments and code, and after that, the “Number of frames per second” is configured. The function is created to read five frames per second and save them in a certain folder so they may be read later for video processing. The key steps for creating the video categorization model are as follows:
Look through, import, and state the video capture function with the datasets and folder name which is proceeded by the training and validation set analysis. The validation set has been used to assess the trained model after it has been trained using the training set. Following that, frames are taken out of the movies that make up the acquired training set and the validation model. The model is trained using the extracted frames following pre-processing of the extracted frame. Additionally, the validation set’s frames are utilized to evaluate and train the model. Using the trained model, we may categorize and process the movies for fresh data after completing the aforementioned tasks and testing our performance on the validation set.
The aforementioned procedures are carried out using Python software, which necessitates the use of additional libraries. Python software includes a number of libraries, like Pandas, Keras, etc. In the implementation, some functions are first utilized to read the names of the video files as the “Train Set” and “Test Set,” respectively. The name and label class are delimited in the train set, which is a data frame. Then it began reading the frames from the movie at a frame-per-second rate from the folder where we had saved the film. To make processing these frames simpler, we have read 20 frames each second in this case. We now have a folder on our computer for particular videos that contains 20 photos. Then, Training and Testing datasets are generated, which would be utilized in our deep learning-based CNN architecture, and develop the function that will read our input photos. As we previously described, the aforementioned function reads the 20 photographs that we stored in the appropriate folders made for each film. CNN was developed with the NumPy Library. The aforementioned functions produce a Numpy array with the following attributes: channels, size kernel, number of frames, samples, and image length. Tensors and normalizing are used to transform the pictures into a grayscale format. Therefore, the picture would be 250 by 250 pixels, and the channel would be 1. The information is then prepared to be fed into the neural network (NN). With the help of this technique, we were able to interpret the films and turn them into tensors for the NN. For processing lengthy films and massive datasets, Keras’ “Generators” are employed. Direct video input is fed into the model during training and testing.
Analysis
ReLU and normalizing are optional in the first, second, third, and fourth CNN layers. The integration of the ReLU and normalizing layers improves the classification accuracy. Additionally, the dropout process is utilized to get rid of overfitting. In order to regularize the whole process, all of CNN’s other features are also looked at, even though there isn’t much room for variation in the results. Additionally, utilizing CNN does not allow any single uni-function to do image or video processing. Thus, to increase accuracy, the LSTM is given the CNN findings. To enhance them across substantial size data, including unknown parameters, all parameters in the CNN are handled using weights. Back-propagation modelling is another name for this optimization procedure. It is possible to create the gradient of the cost by using this model and the CNN parameters. The SGD (Stochastic Gradient Descent) approach is advocated for changing the parameters since it speeds up algorithm execution. By comparing the labels with the actual values, CNN in this study tries to improve classification accuracy. The LSTM is fed the final output.
The Dataset
The experiment verifies the performance of the suggested CNN-LSTM using data gathered from diverse datasets. By affixing medical IoT devices to a person’s body and monitoring their behaviours, real-time data is gathered. For testing the proposed CNN-LSTM, four separate benchmark datasets are used. In addition to Active Miles [16], WISDM-V1.1 [17], DAPHNE-FOG [18], and SKODA [19], there are four other benchmark datasets. Since the video footage spans various lengths, it is challenging to discern the activity. It is acquired through forcing individuals to carry out different actions, such as walking, standing, sitting, etc. From four separate datasets, more than 500 video files were gathered. Each video is personally reviewed, and the best ones are chosen. The length of the movies ranges from 10 to 15 minutes. The actor in the video does ten stretches throughout the course of ten separate clips. Stretching and non-stretching body movements are included in the videos.
Experimental Results and Discussion
The model is ended using the Keras call-back mechanism once 10 epochs have been applied with a batch size of 10. If the ideal outcome cannot be achieved, the number of epochs is raised. Figure 9 displays the experiment’s accuracy and loss for the aforementioned design. Comparing the second model to the first model is studied. Compared to the earlier model, this one is rather sophisticated. In the second model, there are layers as follows,
Convolutional layers, 4 max-pooling layers, LSTM layer

Loss & Accuracy Calculation for Configuration-1.
There are 64 units in this LSTM layer, 4 of which are thick layers, or FC layers, with the output layer being the final. Figure 10 illustrates the training and testing phase’s accuracy and loss based on the aforementioned arrangement.

Loss & Accuracy Calculation for Configuration-2.
20 epochs are taken into account for the suggested architectural model, with 10 being the ideal batch size. The finest “Adagrad” optimization, as well. The correctness is verified by calling back to the Keras library, which is provided. Based on the arrangement shown in Fig. 11, an estimation of the accuracy of the training and testing phase is made. Towards the end, prediction is produced depending on the input time slots of the input films, and Fig. 12 displays the appropriate outcome.

Loss & Accuracy Calculation for Configuration-3.

Cost Estimation for Prediction.
Utilizing the various datasets listed in Table 3 to conduct experiments, the proposed HDLM is assessed. To show the effectiveness of the suggested model, the classification outcome from the photos is produced and put up against other cutting-edge methods. 2500 frames from the total video clip are received in good shape. Even though there are many other types of activities, this research simply compares six stages with other models. Examples of activities are shown in Fig. 13. The test is performed using Python software and a variety of library functions.
Data description

Actions Recognized.
The complete dataset is divided as two parts to perform the model’s training and testing to increase the accuracy of the model. The CNN-LSTM model is trained with the utilization of the training dataset, and the testing procedure is carried out using the categorized results. The confusion matrix is created to verify performance and is shown in Table 4. It illustrates and contrasts CNN-LSTM patient action recognition with alternative approaches. The findings demonstrate that the patient activities are appropriately categorized.
Confusion Matrix
Unlike other traditional classifiers, the CNN-LSTM learns and extracts all of the characteristics that are present in the data. Using additional number layers, the complete picture information, and feature extraction, CNN and LSTM learn the entire datasets from various angles. The performance of the other techniques is based on a larger number of features and wrapper selection-based features. However, Table 5 shows the temporal properties that the CNN-LSTM retrieves. In the context of features, quantity of features, and classifier type, state-of-the-art techniques are compared.
Classifier Comparison with the Existing Models
Table 6 compares the accuracy attained using several traditional methods utilizing various datasets, along with the time required for training and testing on the given dataset using the indicated procedures, which is measured in seconds. In the comparison, the suggested CNN-LSTM model outperformed other models in terms of accuracy. Additionally, by calculating and contrasting the sensitivity, specificity, and accuracy for each of the four datasets separately, the CNN-performance LSTM’s is assessed. It may be found in Table 7. Compared to other datasets, the Daphne-FoG dataset has a higher average CNN-LSTM accuracy.
Accuracy Comparison Regarding Dataset
Performance Evaluation Regarding Accuracy
Performance parameters including sensitivity, specificity, and accuracy are calculated using the experimental findings of the classification together with the TP, TN, FP, and FN values. The dataset-specific findings are contrasted with those produced by the NN techniques currently in use, as mentioned in [20–23]. Figure 14 compares performance in terms of accuracy, and it demonstrates that the suggested CNN-LSTM model acquired the greatest accuracy in comparison to the others. The hybrid deep learning model’s architecture is set up in two different ways to ensure a successful training procedure. Two max-pool layers are added after the LSTM layer in order to learn the data progressively. The output layer is thus thought of as the third layer among the three FC levels. Figure 14 shows the experiment’s results, which showed a 99 percent accuracy rate.

Accuracy comparison.
The highest probability is attained in this model since it can identify moving humans. The model of the first three text videos can determine when the body expands and when it does not with 100 percent accuracy. Due of the misunderstanding between stretching and bending, the final two films only reach 65% precision.
Our suggested system’s accuracy resulted in 99.53 percent, which is more than 4 percent higher than the current models when compared to recent real-time models (Ref. Table 8).
Performance balance with existing model
The aforementioned findings and comparison with other approaches demonstrate that the suggested paradigm is more effective than others. In order to easily characterize the difference that develops during random movement and stretching, we thus need more data. In order to increase accuracy, we will thus add more data in the future.
The objective of this research is to perform continuous monitoring of the patient activity to provide immediate medication whenever needed. This regular monitoring is performed with multiple wearable sensors, which is highly required when the patient is alone for treatment or the person who is accompanying the patient is not nearby to them. This paper aims to provide an automatic healthcare monitoring system for monitoring patients’ health conditions. In the past decade, manual guardians like nurses were appointed for monitoring patients, but it was costly and time-consuming. Several earlier methods have proposed various monitoring of the patients’ health conditions and other activities. Few of them ultimately use medical IoT devices for surveillance monitoring in hospitals. Surveillance monitoring only records the activities happening in a particular location and sends them to the admin system. Still, it cannot provide an alert message to the associated person (i.e., the Doctor) to treat the patient immediately. Also, it is necessary to understand what is the real health problem happening in the patients’ bodies. Hence, in recent times, body wearable sensor IoT devices and medical IoT devices are attached to the patient body, generating the health condition in terms of numbers and signals. Based on the real data, understanding the problem is not easy. So, the data generated by the medical IoT devices are analyzed using data mining methods. Convolution Neural Network Incorporated with Long-Short Term Memory (CNN-LSTM), a HDLF, was thus proposed in this study for a PAMS that brings all healthcare activities with their classes. The proposed framework is implemented in Python software, and its usefulness in assessing the system is confirmed. When the suggested CNN-LSTM is compared to particular existing algorithms, it is found that the new technique has a 99.53 percent accuracy rate. The difference between this accuracy result and the current value is at least 4.73 percent.
Future Work
In the future, the IoT camera will be directly connected to the administrator, and any odd behaviours will immediately trigger an alarm. In future work, not only cameras various IoT devices are incorporated into the experiment, and the practical implications need to be verified. More deep learning algorithms are implemented to experiment with the data generated from the medical IoT devices and choose the better algorithm. In future work, the experiment needs to be done in real-time with an embedded circuit and verify the real-time medical data generation and its complications.
