A unified model of video-based human action categorization using Chaotic Quantum Swarm Intelligence on Intuitionistic fuzzy 3D Convolution Neural Network

Abstract

In the contemporary surveillance schemes of Computer Vision, videos concerning human action categorization have become a predominant zone, involving Pattern Recognition tasks. Factually, most of the human actions comprise complex temporal information, and it is quite difficult to discover the diverse activities of humans precisely, in an unpredictable variety of environmental circumstances. A Deep Learning paradigm can tackle this issue, by providing additional capabilities to vision-based human action recognition. However, there are more complex challenges in extracting the spatio-temporal features, for instance, the presence of noise in videos and the highly vague feature points. This paper proposes a hybrid intelligent Intuitionistic Fuzzy 3D Convolution Neural Network that uses Chaotic Quantum Swarm Intelligence (CQSI-IFCNN), to optimize video-based human action categorization. Vagueness and ambiguity of input video frames are inherited by Intuitionistic Fuzzy networks in terms of membership, hesitation and non-membership components. By applying Chaotic Quantum Swarm Intelligence (CQSI), the learning parameters and error rates that occur in standard convolutional neural network are considerably reduced. The chaotic searching scheme is applied to overcome premature local optima in Quantum Swarm Intelligence. Therefore, this model produces optimized outcomes in Intuitionistic fuzzy 3D Convolutional Neural Networks, thus improving the categorization of human actions in videos. The Performance of CQSI-IFCNN is assessed by using the KTH and UCF Sports Action datasets. From the simulation outcomes, it is observed that CQSI-IFCNN has attained a higher rate of action categorization accuracy than standard CNN and PSO-CNN.

Keywords

Human action categorization deep learning intuitionistic fuzzy Chaotic Quantum Swarm intelligence 3D Convolutional Neural Network

1. Introduction

One of the toughest tasks in Computer Vision is recognition of human action. Precisely, in Video Surveillance, recognition of semantic spatial-temporal visual patterns (running, walking, jumping, etc.) is a primary computer vision issue. As such, the human action recognition system begins by extracting significant features from a given video; by using pattern recognition approach, the video is processed and classified into their resultant action category. Feature Extraction based on Deep learning [1] is essential, in several real-world applications like kinematic analysis, criminal investigation, video retrieval and surveillance. This approach gains knowledge about features, directly from the selected videos. Compared to classification of still images, video-based recognition gives significant clues to detect actions, because it has a temporal component that helps to detect actions based on the motion information.

Fusing, Intuitionistic Fuzzy and Chaotic Quantum Swarm Intelligence on 3D convolutional neural network directly obtains spatio-temporal feature information and thus prominently enriches the process of human action categorization in videos, better than what the existing approaches do.

In video-based human action categorization, the input given to the convolution neural network will be composed of video frames.

1.1 Quantum particle swarm optimization

Quantum Particle Swarm Optimization (QPSO) is a variant of conventional PSO, the development of which is inspired by the basic principles of PSO [2] and quantum mechanism. The unique feature of QPSO is its fast convergence and good searching capability. QPSO differs from conventional PSO because of its uncertainty principle, to determine both the position and velocity of a particle concurrently [3]. In the proposed approach, among $Z$ particles, the position of the $j^{th}$ particle ( $pl_{j}$ ) in M-dimensional space at ( $t+1$ ) iteration is updated as follows:

$\displaystyle\mu best=\frac{1}{Z}\sum^{Z}_{j=1}{{Pl}_{j}}$ (1) $\displaystyle Plj=\tau Pl_{pbest}+(1-\tau)Pl_{gbest}$ (2)

If rnd() $>$ 0.5, then

$\displaystyle Xpos_{id}(t+1)=Pl_{id}-\sigma|\mu best-Xpos(t)|\ln\frac{1}{y}$ (3)

else

$\displaystyle Xpos_{id}(t+1)=Pl_{id}+\sigma|\mu best-Xpos(t)|\ln\frac{1}{y}$ (4)

where, $\mu best$ represents mean best position, which is the personal best position’s ( $p_{best}$ ) mean value of entire elements in the solution space. The contraction expansion coefficient is denoted as $\sigma$ and $\tau$ , $y$ are two random numbers. All these three values, lie between [0, 1]. $P_{best}$ is the personal best position of each element or particle and $G_{best}$ the global best position of the whole swarm.

There are three main aspects that makes the QPSO better than traditional PSO, which are given below:

It is an uncertain scheme, and additional diverse state of particles and larger search space can be produced in this approach, so that it can generate better offspring.

In conventional PSO, convergence occurs faster and leads to local optima very easily, in a very few iterations. Once an approach falls to local optima very quickly, it fails to obtain the best solution. But in QPSO, hosting $\mu\textit{best}$ (mean best) improves the average error rate of convergence, as the particles cannot converge very fast without considering their neighbour particles. Thus, it avoids falling into local optima more frequently, when compared to conventional PSO.

Finally, the parameters involved in the process of QPSO are much fewer than conventional PSO. Thus, they make it easier to execute the process and performance of the QPSO is significantly improved.

1.2 Chaos theory

Chaos theory is a study area in mathematics, which is functional in Philosophy, Economics, Physics and Biology [4]. It is highly sensitive to initial conditions (due to rounding errors in arithmetical calculation) so that its effect is referred to as butterfly effect, in which a slight change in one state of deterministic nonlinear scheme can consequently lead to huge variance in a later state [5]. The chaotic variable can traverse, at a maximum search-space of interest. Even though its variation resembles disorder, it has a delicate inherent rule. Therefore, in the present work, chaotic Quantum PSO algorithm is applied to utilize unevenness and periodicity of variables involved in chaotic, to set the particles and the parameters used by QPSO. The chaotic maps greatly influence the prevention of premature convergence in PSO. In addition, it can also improve the local and global search abilities of the system [6].

Chaotic maps in PSO can be implemented by two different approaches. They are:

1.
The performance of PSO is greatly dependent on the parameters involved in it like W, C1 c2 and random number generators. In the first approach, the chaotic maps reinforce random number generators and the variables are deprived of any fundamental modifications in the algorithm, because the performance of the PSO is highly dependent on such variables and random number generators.
2.
In the second approach, the chaos is used to discover the best solution space, by interacting with the PSO. In this approach, additional evaluation is done to find the best candidate solution, which holds the best fitness value. In each iteration, for the boundary of search space is reduced considerably to enhance an effective search. Thus, the particles are kept away from local optima. This kind of search is termed as chaos search.

Owing to the chaos effective performance, this proposed work adopts the first approach, for optimal human action categorization using Intuitionistic Fuzzy Convolution Neural Networks.
2. Related work

Song et al. [7] developed a CNN model as a smart phone application, to recognize the facial expression. Using this approach, the user could capture the picture and transfer it to the server. The facial expression prediction process was done by the network and the result was passed on to the user. With the help of ImageNet dataset, the model trained the network to recognize facial expression, within a couple of hours.

Bregonzio et al. [8] developed a multiple kernel learning approach, which fused features of interesting points. It used global distribution evidence to extract interest points, for recognizing action in videos. It extracted holistic features from temporal features.

Chen and Hauptmann [9] devised an approach known as MoSIFT, to find spatial interest points. SIFT algorithm was adapted on visually unique modules and it discovered interesting points of spatio-temporal with motion constraints. This approach consisted of optical flow around unique points in a sufficient amount.

Baccouche et al. [10] proposed a fully automated deep learning system, which extended CNN to 3D CNN so that it could learn automatically the spatio-temporal features, to classify human actions. With the learned features, recurrent neural network was applied to perform classification on each video sequence.

Juan et al. [11], in their work, proposed probabilistic latent semantic analysis, which handled noise present in the feature points. Noise arose due to moving cameras and dynamic background of action videos. In this approach, given a video sequence, this algorithm localized the actions of humans and categorized them.

Gilbert et al. [12], in their work, mined dense temporal features, which were grouped in hierarchical manner, to generate a complete feature set. By applying data mining approach, frequent patterns of features were discovered. Using hierarchical classifier, real time video action was recognized.

Haiam et al. [13] developed a trajectory-based representation, which captured spatio and temporal feature points termed as cuboid features, using SIFT descriptors on video frames. Using linking and exploring method, Bag-of-Words was designed. At last, to classify human actions, Support Vector Machine was applied.

Sun et al. [14] in their work, extracted significant trajectories with long duration, using SIFT descriptor and KLT tracker. With image structure, its salient features were sampled in a random manner for tracking actions. They used spatio-temporal information of video frames, to perform activity recognition.

Raptis and Soatto [15] developed a trajectory descriptor, using the statistics of spatio-temporal features of image. The final descriptor was used in video analysis and action recognition.

Figure 1.

Overall Framework of Human action categorization using Chaotic Quantum Swarm Intelligence based Intuitionistic fuzzy 3D – Convolutional Neural Network.

Ignatov [16] presented a deep learning approach, based on user-independent online human activity classification. The local features were extracted along with simple statistical feature using CNN, which preserved details about global system of time series. They also investigated the length of time series (which might influence accuracy) and controlled it, which made a potential continuous activity classification.

Li et al. [17] applied, unsupervised feature extraction using deep Boltzmann machine. It was not intended to capture local features of data, but its performance was superior to other existing hand-crafted solutions.

Kumaravel and Veni [18] developed a Multi-Object Categorization Model in Scenes, using Fuzzy Swarm Intelligence technique, in Deep Belief Neural Network model. The element of Uncertainty was better handled by using the Fuzzy technique.

Kwapisz et al. [19] performed activity recognition using a phone-based accelerometer. They designed a customized smart phone application and acquired data through an accelerometer embedded within it. The system had the ability to recognize various actions.

The aforementioned works on human activity recognition, nevertheless, have not established any method to handle vagueness and imprecision in video-based human action categorization. The significance of this proposed approach is, the use of optimal depth knowledge of convolution neural network in dynamic images for action categorization. Fusing, Intuitionistic Fuzzy and Chaotic Quantum Swarm Intelligence on 3D convolutional neural network, directly obtains spatio-temporal feature information, thus evidently enriching the process of human action categorization in videos, better than existing approaches.

3. Methodology

Chaotic Quantum Swarm Intelligence fused Convolutional Neural Network for human action categorization in videos (CQSI-CNNHAC) (Fig. 1).

This work enumerates a hybrid intelligent Human action categorization, by developing Chaotic Quantum Swarm Intelligence-based, Intuitionistic fuzzy 3D – Convolutional Neural Network, for human action categorization using videos. Two different datasets are collected from most popularly used KTH [20] and UCF sports dataset [21]. After acquiring the video dataset, its resolution is enhanced by representing the video frames in terms of Intuitionistic Fuzzy Domain, where each pixel is represented using three different components, namely degree of membership ( $\mu$ ), hesitation ( $\pi$ ) and non-membership function ( $\nu$ ), so that the noise from the raw video will be eliminated, thus enhancing the process of feature point extraction. Once the video frames are passed into the 3D convolutional neural network, the dynamic object features using spatial and temporal extents are evaluated. Using three-dimensional kernel for filtering, it is possible to acquire 3D convolution to the cube, obtained by accumulating many spatio-temporal reinforcements, organized in an adjacent way. In the convolution layers, the feature maps are linked with multi-frames, organized closely in the preceding layer, to grasp data associated with human action.

Figure 2.

Layers in a 3D – CNN.

In dynamic environment, to obtain the temporal information, frame difference is applied on 3D Motion cuboid to discover the gesture in frames, identical to moving objects. By subtracting previous frame $F_{t}$ with current frame $F_{t+1}$ on pixel level, the absolute temporal variance among subsequent frames can be discovered, and the formula used is:

$\displaystyle\text{3D motion detection}=\text{absdiff}(F_{t}-F_{t+1})$ (5)

After retrieving the temporal evidence, it is observed that 3D convolution kernel has the ability to choose any kind of feature from cuboid patch. The present design scheme of CNN is that, when the layers started increasing, the number of feature maps also grows and consequently, the different map levels give rise to increased Feature types. The value at a specific position of the video frame ( $x . y . z$ ) on the feature map $k$ in the $j^{th}$ layer is given as:

$\displaystyle\textit{Vec}_{j,k}^{x,y,z}=\textit{tanh}\ {e_{j,k}}\mathop{\sum}% \limits_{m}\sum_{d=0}^{{D_{i}}-1}\sum_{e=0}^{{E_{i}}-1}\sum_{f=0}^{{F_{i}}-1}% wt_{jkl}^{\textit{def}}\textit{vec}_{({j-1})l}^{({x+d})({y+e})({z+f})}$ (6)

Where $F_{i}$ is the size of 3D filter kernel with temporal measurement, ${wt}^{\textit{def}}_{jkl}$ is denoted as the weight of $j^{th}$ layer with $K^{th}$ kernel and ( $d, e, f$ ) the feature map related to the $l^{th}$ value of $K^{th}$ kernel.

Parameters’ receptive field (RF), stride length (SL), zero padding (ZP) and volume (wt, ht, dm) are used in computing the Spatial Magnitude, corresponding to the Output of the Intuitionistic Fuzzy 3D Convolution Network (Fig. 2). The neurons in the convolution layer are computed by:

$\displaystyle((wt-\textit{Inp\_F}+2*\textit{ZP})/\textit{SL})+1.$ (7)

The input layer of 3D CNN comprises $\frac{(120-11+2))}{1}+1=$ 110, which acquires output volume of 110 $\times$ 110 $\times$ 32, and wt and ht are the weight and height of the input Frame whose value is 120, and the Frame represented with the volume of 11 $\times$ 11 $\times$ 32 is filter depth, 0 zero padding ZP and 1 Stride length.

Each time the weight and bias of the 3D CNN are updated, the chaotic quantum swarm intelligence is applied, to assign the weights of the layers. The Chaotic Quantum based approach thus significantly handles the uncertainty in the selection of feature points to detect action and overcome premature local optima, thereby providing enriched knowledge to CNN during the training phase to yield more accurate output vectors, compared to the existing approaches, and finally, the fully connected layer produces the label of category that the human action video actually belongs to.

The detailed explanation of each stage, are given in the following subsections:

3.1 Representation of video frames in intuitionistic fuzzy domain

Assume the universe of discourse to be U and W a set in the universe U which comprises bright pixel values. An Intuitionistic Fuzzy image is signified using three unique subsets, namely the grade of membership ( $\mu$ ), indeterminacy ( $\pi$ ) and non-membership ( $\nu$ ). While converting each pixel of a video frame or a dynamic image, the pixel is represented in the form of Pix( $\mu$ , $\pi$ , $\nu$ ). The membership of trueness about bright pixel is $\mu$ %, uncertainty of brightness about the pixel $\pi$ % and non-membership of truthness about the brightness of a pixel $\nu$ %. In this, $\mu$ , $\nu$ , $\pi$ lies in the interval value of [0,1]. The pixel $pix(i,j)$ in a dynamic image or video Frame, is changed to:

$\displaystyle\text{IDP}_{\text{IFS}}=<\mu(i,j),\nu(i,j),\pi(i,j)>,$ (8)

where $\mu$ signifies white set, $\pi$ sign spots indeterminate value and $\nu$ fits to non-white set.

3.2 Chaotic quantum particle swarm optimization

A lot of versions related to PSO algorithms i.e., swarm intelligence are available. In the present work, chaotic quantum particle swarm optimization (CQPSO) is used for optimizing the vectors in video frames, processed by Convolution neural network to categorize the human activity in videos. The chaotic approach is used to traverse on a large population of particles in a chaotic motion, to select the optimal particles instead of using randomness. It also prevents the QPSO from early convergence to local optima, and the CQPSO replaces the random sequence with chaotic sequence, to gain the diversity of PSO population, and thus, it improves the performance of the algorithm by restraining the minimization of local convergence.

The logistic equation of chaotic mapping is represented as follows:

$\displaystyle L_{m+1}=\nu L_{m}(1-L_{m})$ (9)

Where $m$ is the iteration of the method, and $\nu$ a control parameter, when $\nu=$ 4 [17], $L_{1}\varepsilon$ [0.25, 0.5, 0.75] the value of $L$ shall finally reach each neighbour in a subrange of value [0,1] with the finest recurrence. A slight variance in the L’s preliminary value might provide augmentation to huge variance in its long-term activities, thus signifying the elementary feature of chaos. In this work, the map for logistic is represented as:

$\displaystyle L_{m+1}=4L_{m}(1-L_{m})$ (10)

The procedure of CQPSO is well-defined as trails:

Set the particle’s population at maximum iteration.

Set particles’ velocities and positions by chaos theory.

Compute each particle’s fitness value, according to:

$\displaystyle\text{Fitness}(i,j)=\frac{!}{mxn}\sum^{m-1}_{i=0}\sum^{n-1}_{j=0}% {\textit{pixel}\_\textit{value(x,y})}$ (11)

where, $m\times n$ is the size of the video frame.

Calculating the position of each particle using the formula:

$\displaystyle\mu\textit{best}_{i}=\frac{1}{Z}\sum^{N}_{j=1}{{Pl}_{id}}$ (12)

To find the local optima, Eq. (12) is used and for global optimal position of each particle Eq. (13) is used:

$\displaystyle\mu\textit{best}=\frac{1}{Z}\sum^{N}_{j=1}{\textit{Pl}}^{j}_{gb}$ (13)

where $\textit{Pl}_{gd}$ is the global optimal position and ${\textit{Pl}}_{id}$ represents the local best position of individual element or particle.

Select $\textit{Pl}_{id}$ and $\textit{PI}_{gd}$ among particles and update them.

Produce novel Positions and velocities of particles, by quantum mechanism, as below:

$\displaystyle\textit{Plj}=\tau\textit{Pl}_{\textit{pbest}}+(1-\tau)\textit{Pl}% _{\textit{gbest}}$ (14) $\displaystyle\textit{Xpos}_{id}(t+1)=\textit{Pl}_{id}+$ (15) $\displaystyle\hskip 56.905512pt\sigma|\mu\textit{best}-\textit{Xpos(t)}|\ln% \frac{1}{y}$

Subtracting the local and global optimal value of each particle

$Y=Pl_{gd}-{Pl}_{id}$ (16)

10.

If $Y<\sigma$ then regenerate next generation:

Figure 3.

Sample frames of the 4 categories.

Using equations of mean best position of particles, the position of the next generation of particles is calculated.

11.

Verify whether the termination condition is reached or not. If it meets the criteria, then stop the process and output the result, else go to step 3.

If the result $Y$ is smaller than definite order of magnitude ( $\sigma$ ) or is a minus, it means, this particle has dropped into the local minimum and Logistic system can be castoff to locally enhance this portion of particles, thus averting them from starting local minimization too rapidly and from deteriorating, to discover their optimal locations.

3.3 CQSI-IFCNN: Chaotic Quantum Swarm Intelligence based intuitionistic fuzzy convolution neural network for human action categorization in videos

In this work, the CNN consists of input video frames which are represented in intuitionistic fuzzy domain to enhance the quality of the frames. Six convolution kernels with their 5 $\times$ 5 filter size are employed, for processing the input Frames. For subsampling, this work has used 6 kernels with 2 $\times$ 2 size filter. In the second layer, convolution kernel used is 12, whose size is 5 $\times$ 5 pixels, subsampling kernel used is 12 with 2 $\times$ 2 filter size, and the fully connected layer is the last layer of CNN.

The CQSI-IFCNN is used to improve the output vector. Using $\delta x$ , the output vector is amplified to acquire better value, using chaos and quantum theory.

3.3.1 Procedure for CQSI-IFCNN training phase

•
Initialize the CNN learning rate as 1.
•
Initialize the batch size, number of CNNs epoch.
•
Initialize CQPSO iteration.
•
Check the convergence of CQPSO:

–
if the error rate has not been different for consecutive three iterations, then expect convergence.
–
Regenerate the particles and set the position of the particles using quantum PSO theory.

•
The vector output produced by CNNs, will be optimized using CQPSO system reducing the error rate of CNN.
•
If the solution obtained by swarm holds lower error rate, while comparing with old output of the vector, then the vector value will be updated.
•
CQPSO continues its process, until the maximum iteration is reached and the result of convergence is not satisfactory.
•
The Accuracy of CNN model reflects how well it predicts the actual value during testing phase.

3.3.2 Algorithm: Chaos Quantum Swarm Intelligence based intuitionistic fuzzy convolution neural network for human action categorization in videos

Step 1:
Obtain the original video and convert it into frames.
Step 2:
Apply intuitionistic fuzzy Contrast intensification to enhance the quality of the video frames for further process.
Step 3:
Pass the input frames into 3D convolution network, to identify the significant features based on spatio-temporal evidence.

a)
In this work, the 3D CNN comprises an of 11 frame cube, to extract the motion information as input.
b)
The original frame size 160 $\times$ 120 is reduced to 120 $\times$ 120, in order to reduce the usage of memory and computation complication.
c)
The Intuitionistic fuzzy 3D CNN, as shown in Fig. 1, contains 120 $\times$ 120 $\times$ 11 inputs along with the kernel size and the quantity of feature maps involved in each layer.
d)
The first convolutional layer uses 11 $\times$ 11 $\times$ 32 and second convolution layer uses 5 $\times$ 5 $\times$ 32 size kernels.

Figure 4.
Overall Performance comparison of three different methods, with four categories of action video.

e)
For the layers of max pooling, the kernel size is 2 $\times$ 2, which progressively reduce the spatial information, to overcome the problem of over fitting

*
Applying Chaotic Quantum Particle Swarm Optimization to detect the optimal feature vectors for categorization of the human action

⋅
Call function CQSI-IFCNN

f)
Lastly, the entire activation results in the preceding layer is converted into 6400 vectors of features, in the fully connected layer of CNN.
g)
The output units, which hold the result of categories of human action, are available in softmax layer.

Step 4:
After passing the feature vectors to the fully connected layer of the 3D CNN, the class to which the selected video belongs is given as output.

4. Results and discussion

This work, CQSI-IFCNN, is simulated using Matlab software. The dataset used for human action recognition is KTH Dataset [20], which comprises six different action categories like hand waving, walking, running, jogging, hand clapping and boxing. Each kind of action involves 25 different themes in four unique scenarios like outdoors, scale variation based outdoors, indoors and different clothes in indoors). The experimental setups are divided into two sets, test and train.

For simulation analysis, samples of four different categories of action video such as walking, waving the hands, cycling and surfing are selected. The KTH dataset is used for two categories namely, running and waving hands, and the remaining two categories (surfing and cycling) are collected from UCF Sports Action datasets [21].

The sample frames of the four different categories are given in the Fig. 3.

CQSI-IFCNN classifier’s confusion matrix on action-based videos are exposed in Table 2, where precise predictions are exposed across diagonal cells of the table, and, at the most, all four actions categories are predicted above 95%.

From Fig. 4, it is observed that the accuracy in categorization of human actions is greatly improved in CQSI-IFCNN. This is because the optimal knowledge gained by CQSI-IFCNN using the chaotic quantum swarm intelligence overwhelms the premature decision of action categorization and handling uncertainty by representing the video frames in intuitionistic fuzzy domain. The other two existing methods fail to handle the issues related to ambiguity, vagueness and uncertainty that arise during feature point extraction and temporal feature extraction.

Table 1
Performance metrics obtained from four different categories of action categorization using CQSI-IFCNN

Class	Precision	Recall	F-measure
Walk	0.971	0.953	0.945
Wave	0.963	0.971	0.965
Cycling	0.962	0.968	0.961
Surfing	0.975	0.978	0.972
Average	0.96775	0.9675	0.96075

Figure 5.

Performance comparison of the 3 Methods, based on PRECISION.

Figure 6.

Performance Comparison of the 3 Methods, based on RECALL.

From Table 1 and Figs 5 and 6, it is observed that the average categorization rate of CQSI-IFCNN is 96.7%. This approach produces better result, while compared to standard CNN and PSO-CNN, because the CQSI-IFCNN has the ability to handle vagueness and indeterminacy in recognition of the action points of humans, in Videos.

Table 2

Confusion matrix of CQSI-IFCNN for four different categories of human action

Class	Walk	Wave	Cycling	Surfing
Walk	97.2	1.6	0.4	0.8
Wave	1.6	96.3	0.7	1.4
Cycling	1.8	1.9	94.2	2.1
Surfing	1.2	0.8	0.5	97.5

From Table 2 it is observed that the diagonal cell produces accuracy in the categorization of walking, hand waving, cycling and surfing. By reducing the dimensionality of vector search space using quantum swarm intelligence and chaotic mechanism for selecting the best feature point trajectories during the learning phase of the 3D CNN, the human action recognition is empowered.

Figure 7.

Performance comparison of the 3 methods, based on time taken.

Figure 7 depicts, the time taken by three different approaches, namely CNN, PSO-CNN and CQSI-IFCNN. The time taken by the CQSI-IFCNN is considerably reduced, even after the integration of many approaches, because the frames are represented by intuitionistic fuzzy components and the premature convergence of swarm intelligence is well-handled by chaotic mechanism. IFCNN takes into account the significant features, and the search space is also greatly reduced by adapting CQSI. Thus, the process of CQSI-IFCNN is faster than the other two existing approaches, which have large search space and high computation complexity.

Figure 8.

Performance comparison of the 3 methods, based on error rate (MSE).

Figure 8 shows that, the CQQSI-IFCNN produces a lesser error rate of 0.0309 in action recognition, when compared to the other two models CNN and PSO-CNN, which gave error rates of 0.0742 and 0.0543 respectively. In the CNN and PSO-CNN models, the identification of significant feature vectors are done through sampling, max pooling and filtering processes, which work under random location selection; while in the CQQSI-IFCNN model, the quantum swarm intelligence is introduced, to derive the potential feature vector of spatio-temporal features more precisely and in a parallel manner, and thus, it reduces the false alarms during feature extraction.

Figure 9.

Computation Sequence of CQSI-IFCNN e.g. CYCLING.

Figure 10.

Computation Sequence of CQSI-IFCNN e.g. SURFING.

Figure 11.

Computation Sequence of CQSI-IFCNN e.g. WALKING.

Figure 12.

Computation Sequence of CQSI-IFCNN e.g. HAND WAVING.

The Fig. 9 – Cycling, Fig. 10 – surfing, Fig. 11 – Walking, Fig. 12 – hand waving, show, the step by step process of CQSI-IFCNN based human action recognition. After receiving the input, the videos are converted to video frames and are preprocessed using intuitionistic fuzzy to enhance the quality of the video frames, by representing the pixels based on the components of membership, hesitation and non-membership degree. This representation inhibits the insignificant feature points under consideration. The 3D CNN, after receiving the frames, extracts the spatio-temporal features, by inducing the knowledge of chaotic quantum particle swarm optimization during the learning phase, by fine tuning the weights, bias and constant parameters. Finally, with this hybrid intelligent approach, the technique categorizes the human actions with a high degree of precision.

5. Conclusion

This work focuses on three main aspects of human action catagorization, using hybrid intelligent CQSI-IFCNN model.

•
in the first phase, vagueness and ambiguity pres-ent in action videos are handled by adopting intuitionistic fuzzy representations of video frames.
•
secondly, adopting chaotic quantum swarm inteligence to improve the knowledge of 3D CNN, in extraction of feature points and
•
finally, fine-tuning the parameters used in the learning phase.

From the simulation results, it is observed that the CQSI-IFCNN exhibits more accuracy in human action categorization. While comparing the existing approaches like CNN and PSO-CNN, CQSI-IFCNN performs better categorization of action videos (as shown in the tables and the figures) due to its inherent knowledge, in avoiding premature selection of negative solutions and its ability to handle the uncertainty environment, which commonly arises in real time videos.
6. Future work

The present work can be extended to perform multi-action recognition system and automatic action captioning in videos. Other bionic behavior intelligence can be adopted for weight optimization and uncertainty handling.

References

Geng

Song

. Human action recognition based on convolutional neural networks with a convolutional auto-encoder, 5th International Conference on Computer Sciences and Automation Engineering (ICCSAE 2015). 2015; pp. 933-939.

Ran

Duan

Gao

, et al. Improved particle swarm optimization approach to path planning of amphibious mouse robot, in: Proceedings of the 6th IEEE Conference on Industrial Electronics and Applications, Beijing. 2011, pp. 1146-1149.

Bin

Wang

Sun

. Niche quantum-behaved particle swarm optimization with chaotic mutation operator. Computer Applications Software. 2009; 26(1): 50-52.

Jin

Kim

. Real-time human action recognition using CNN over temporal images for static video surveillance cameras. LNCS 9315. 2015; 330-339.

Duan

Liu

. Chaotic artificial bee colony approach to Uninhabited Combat Air Vehicle (UCAV) path planning Aerospace Science and Technology. Aerospace Science and Technology. 2010; 14: 535-541.

Syulistyo

Purnomo

DMJ

Rachmadi

Wibowo

. Particle Swarm Optimization (PSO) for training optimization on convolutional neural network (CNN), Journal of Computer Science and Information. 2016; 9(1): 52-58.

Song

Kim

Jeon

. Deep learning for real-time robust facial expression recognition on a smartphone. In: Consumer Electronics (ICCE), IEEE International Conference. 2014, 564-567.

Bregonzio

Xiang

Gong

. Fusing appearance and distribution information of interest points for action recognition. Pattern Recognition. 2012; 45(3): 1220-1234.

Chen

Hauptmann

. MoSIFT: recognizing human actions in surveillance videos. Technological report, CMU-CS-09-161, Carnegie Mellon University, 2009, pp. 1-16.

10.

Baccouche

Mamalet

Wolf

Garcia

Baskurt

. Sequential deep learning for human action recognition LNCS 7065. Springer-Verlag Berlin Heidelberg. 2011, pp. 29-39.

11.

Juan

Wang

. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision. doi: 10.1007/s11263-0, Springer Science+Business Media, LLC, 2008.

12.

Gilbert

Illingworth

Bowden

. Fast realistic multi-action recognition using mined dense spatio-temporal features. IEEE International Conference on Computer Vision. November 2009; 925-931.

13.

Haiam

Abdul-Azim

Hemayed

. Human action recognition using trajectory-based representation. Egyptian Informatics Journal. 2015; 16: 187-198.

14.

Sun

Yan

Cheong

. Activity recognition using dense long-duration trajectories. In: Multimed. Expo (ICME). IEEE Int. Conf, 2010, pp. 322-327.

15.

Raptis

Soatto

. Tracklet descriptors for action modeling and video analysis. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 6311, LNCS (PART 1), 2010. pp. 577-590.

16.

Ignatov

. Real-time human activity recognition from accelerometer data using Convolutional Neural Networks. Applied Soft Computing. 2018; 26: 915-922.

17.

Shi

Ding

Liu

. Unsupervised feature learning for human activity recognition using smartphone sensors, in: Mining Intelligence and Knowledge Exploration: Second International Conference, 2014. pp. 99-107.

18.

Kumaravel

Veni

. Uncertainty handling of multi-object categorization in scenes using fuzzy swarm intelligence based deep belief network. International Journal of Engineering and Advanced Technology, 2019: 8(3S): 63-72.

19.

Kwapisz

Weiss

Moore

. Activity recognition using cell phone accelerometers. ACM SIGKDD Explorations Newsletter. 2011; 12(2): 74-82.

20.

http://www.nada.kth.se/cvap/actions.

21.

http://crcv.ucf.edu/data/UCF_Sports_Action.php.