A survey on action and event recognition with detectors for video data

Abstract

In this paper, the development of action and event detectors over the past three decades is summarized. The detectors are divided into 2D detectors, 3D detectors and deep learning detectors according to whether they contain spatial information and whether they use deep learning. This paper briefly introduces the typical detectors of the different types mentioned above, and explains the advantages, disadvantages and characteristics respectively, and compares them. Comparing traditional feature detection methods with ones based on deep learning, we found that the method of first detecting microscopic details such as point, line, surface angle, etc., and then performing action and event recognition is no longer the mainstream of current research. Due to the strong generalization ability, end-to-end action and event recognition methods based on deep learning perform better than traditional methods. Finally, this paper proposes three research directions for action recognition and event recognition based on feature detectors.

Keywords

Action recognition detector detection deep learning

1. Introduction

The rich video data on the Internet is a very significant data source for video processing research. Feature detectors focus on detecting and identifying the features of interest in static images or dynamic videos. For example, in the application of behavioral action and event recognition in surveillance videos, it is necessary to detect and identify the local or global information required from surveillance videos containing various chaotic elements. Due to its wide application scenarios, action and event recognition has attracted more and more researchers’ attention in the field of computer vision.

The difference between action and event [46] includes: action mainly refers to the movement process of the human body; an event is a combination of multiple actions, which can also include multiple individual actions. So events are more complex than basic actions. For example, running and jumping are two actions; the hurdle can be seen as an event that combines running and jumping [28].

In order to achieve high recognition accuracy of actions and events, the challenges faced by feature detectors are how to obtain robust recognition results under variable illumination conditions, visual angle changes, regional occlusion, camera motion and scaling, and geometric scale changes of detection objects. Few of the traditional feature detectors will take the time information or global information of the action and event into account in the calculation. Compared with the traditional feature detectors, most of the detectors with deep learning consider the time information and the global information of the action, or both, i.e., Spatio-Temporal Progressive action detector [67], structured segment network detector [69] and Coarse-to-Fine Action detector [35].

In the field of action and event recognition, there is no article to summarize and compare the traditional detectors and deep learning detectors. In this paper, the deep learning detectors and the traditional classical detectors in the past decade are analyzed and compared in detail at the algorithm level, which will help relevant researchers to further improve the performance of existing detectors and improve the efficiency of action and event recognition. The main contributions of this paper are as follows:

(1)
The main traditional detectors and deep-learning based detectors of the last few decades are summarized in detail;
(2)
The specific applications of various detectors at the algorithm level are described in detail;
(3)
Three future research directions in action and event recognition are proposed.

2. Feature detector taxonomy

Feature detectors represent static images or dynamic video clips as combinations of interest points or interest regions and are suitable for recognizing targets with sharp extrema and stable lighting conditions. Depending on whether temporal information is involved, feature detectors can be further classified into 2D detectors, such as Harris corner detector [21], Harris–Laplace detector [43 –45], difference of Gaussian detector [40,41] and deformable parts model detector [14,16], 3D detectors, such as Harris3D detector [32,33,37], cuboid detector [9], Hessian3D detector [64], and spatiotemporal DPM detector [55], and deep learning detectors, such as faster R-CNN detector [47], T-CNN detector [22], STEP detector [67], SSN detector [69], TACNet detector [52], privacy-preserving detector [49], coarse-to-fine action detector [35], TxVAD detector [65] and TubeR detector [68].

In this paper, we concentrate on reviewing the most popular proposed feature detectors, which are from either top conferences or top journals in computer vision and pattern recognition, for event and action recognition during recent years. These proposed feature detectors can be firstly classified into three categories, which are 2D detectors, 3D detectors and deep learning detectors, as shown in Figure 1.

Fig. 1.

The hierarchical taxonomy of this review.

By all accounts, 2D and 3D detectors are traditional methods using manual features. Since 2014, in video understanding, due to hardware development, sufficient data, and excellent performance of deep learning, traditional methods have been replaced by deep learning methods [13,51]. In this paper, the deep learning methods discussed are mainly convolutional neural networks (CNN) and Transformer. Compared with convolutional neural networks, Transformer has achieved competitive performance in the field of computer vision in the past two years [42,66]. However, due to the lack of inductive paranoia, the transformer’s performance is not as good as CNN on small-scale data sets (such as UCF101-24) from scratch. After large-scale data pre-training, Transformer can get better results than CNN on iconic data sets (such as Kinectics-700 data sets) [10,56].

The rest of the review paper is organized as follows. Firstly, approaches for feature detectors are reviewed in Section 2. Then, Section 3 briefly introduce 4 mainstream video datasets, and Section 4 covers approaches for 2D detectors. In addition, Section 5 presents approaches for 3D detectors and Section 6 for deep learning based approaches. Finally, Section 7 concludes the review paper, as shown in Table 1.

Table 1

Summary of the approaches for feature detectors.

Reported paper	Year	Method	Type	Dataset
[21] Harris et al.	1988	Harris corner detector	2D detectors	-
[40] Lowe et al.	1999	Difference of Gaussian detector		-
[45] Mikolajczyk et al.	2004	Harris–Laplace detector		-
[50] Rosten et al.	2006	Learned FAST corner detector		-
[16] Felzenszwalb et al.	2008	Deformable parts model detector		-
[33] Laptev et al.	2003	Harris3D detector	3D detectors	-
[9] Dollár et al.	2005	Cuboid detector		-
[64] Willems et al.	2008	Hessian3D detector		-
[61] Wang et al.	2009	Dense sampling detector		-
[55] Tian et al.	2013	Spatiotemporal DPM detector		-
[47] Ren et al.	2015	Faster R-CNN detector	Deep learning detectors	-
[49] Ren et al.	2016	Privacy-preserving detector		-
[69] Zhao et al.	2017	SSN detector		-
[22] Hou et al.	2017	T-CNN detector		UCF101-24, J-HMDB-21
[52] Song et al.	2019	TACNet detector		UCF101-24, J-HMDB-21
[67] Yang et al.	2019	STEP detector		UCF101-24, J-HMDB-21
[35] Li et al.	2020	Coarse-to-fine action detector		UCF101-24, J-HMDB-21
[65] Wu et al.	2022	TxVAD detector		UCF101-24, JHMDB-21, AVA, AVA-kinetic
[68] Zhao et al.	2022	TubeR detector		UCF101-24, JHMDB-21, AVA

3. Dataset

As a data-driven approach, deep learning methods have become mainstream in the field of computer vision. Therefore, datasets have a great impact on the performance of deep learning methods. The brief introduction about several mainstream video datasets for action recognition and event recognition as follow:

UCF101-24 [53] is a subset of UCF101 dataset. UCF101-24 has a total of 24 classes, and the scale is relatively small. The video is divided into video clips, each video clip has only one action, and the action category is highly correlated with the background.

JHMDB-21 [24] is a subset of HMDB51 [31] with 21 classes. UCF101-24 and JHMDB are densely labeled, that is, each frame is labeled.

AVA [20] dataset has a large scale, with 430 15-minute movie clips, a total of 80 categories, and more diverse behavior categories. AVA is sparsely labeled, that is, one frame is labeled every second.

Kinetics [3 –5] dataset is widely used in the field of behavior recognition. There are three types, which are called Kinetics-400, Kinetics-600, and Kinetics-700 categories. The Kinetics dataset is taken from YouTube videos. Each clip lasted for 10 seconds. There are three types of action single-player movements, everyone’s movements, task movements, etc. The Kinetics dataset with many categories is a superset of fewer categories. The Kinetics dataset not only provides the labeling of judging models as a standard data set for behavioral recognition but also plays an important role in other fields, such as self-supervised learning [56].

AVA-Kinetic [34] contains videos of AVA and Kinetics-700 [4]. Kinetics-700 uses an AVA-style annotation method. Like the AVA dataset, the AVA-Kinetic dataset has a total of 80 classes.

EPIC-Kitchens [7] dataset contains a 55-hour first-person video shoot in the kitchen. There are 39.6K action fragments and corresponding object boundary boxes, of which more than 50 samples have 149 action classes. EPIC-Kitchens dataset was shot by 32 volunteers using Head-Mountain Gopro. EPIC-Kitchens-100 [8] has extended the Epic-Kitchens dataset and uses more dense pipe annotations to be used for motion recognition, action detection, action prediction, and cross-model retrieval.

4. 2D detectors

4.1. Harris corner detector

Overlapping with interest point detection, corner detection is frequently used in event and action recognition, where the Harris corner detector [21] is one of the most classic algorithms in the corner detection.

Let I be one image, $I x$ together with $I y$ be the partial derivatives of I, $(x, y) \in Ω$ be an image patch, and $(u, v)$ be a shift of the image patch. Then the change of the image for the shift [21] is

E (u, v) = \sum_{x, y} w (x, y) {[I (x + u, y + v) - I (x, y)]}^{2},

(1)

where

w (x, y)

is a window function which is typical a Gaussian function. After a Taylor expansion of the shifted image

I (x + u, y + v)

, the change of the image [21] can be approximated by

E (u, v) = [u v] M {[u v]}^{T},

(2)

where the structure tensor can be obtained by

M = \sum_{x, y} w (x, y) [I_{x}^{2}, I_{x} I_{y}; I_{x} I_{y}, I_{y}^{2}]

Furthermore, Harris and Stephens [21] proposed a corner response function R, which is defined as

R = det M - k {(t r a c e M)}^{2},

(3)

in order to avoid computing expensively the eigenvalue decomposition of the matrix M. Steps of Harris corner detection algorithm [21] involve [54]:

(1)

compute first image derivatives $I x$ and $I y$ with 3-by-3 derivative masks, as:

I_{x} = I \otimes [- 1 0 1; - 1 0 1; - 1 0 1]

(4)

and

I_{y} = I \otimes {[- 1 0 1; - 1 0 1; - 1 0 1]}^{T};

(5)

(2)

compute products of derivatives [21] at every pixel by

\begin{aligned} I_{x}^{2} & = g * (I_{x \cdot} * I_{x},) \\ I_{x} I_{y} & = g * (I_{x \cdot} * I_{y},) I_{y}^{2} = g * (I_{y \cdot} * I_{y}) \end{aligned}

(6)

where g is a weighting kernel, which is usually a Gaussian kernel;

(3)

compute the corner response matrix [21] in Eq. (3), where $det M = I_{x^{.}}^{2} * I_{y}^{2} - (I_{x^{.}} * I_{y})^{2}$ , $t r a c e M = I_{x}^{2} + I_{y}^{2}$ , and k is a small number, such as 0.07;

(4)

compute local maxima in windows of R above a threshold. For example, if the size of the window is 5-by-5, then non-zero elements [21] in the matrix

R = (M == ordfilt 2 (M, 25, ones (5))) & (M > threshold)

(7)

are the detected corners, where

ordfilt 2 (M, 25, ones (5))

replaces each element in M by the 25-th element in its 5-by-5 neighbours.

4.2. Harris–Laplace detector

It is well known that the Harris corner detector [21] is variant to detection scales. Mikolajczyk and Schmid [43 –45] solved the issue by proposing a Harris–Laplace detector.

Let I be one image. Then, the measure of the Harris–Laplace detector [43 –45] is

μ (x, y) = σ_{D}^{2} g (σ_{I}) * [L_{x}^{2} (x, y, σ_{D},) L_{x} L_{y} (x, y, σ_{D}); L_{x} L_{y} (x, y, σ_{D}) L_{y}^{2} (x, y, σ_{D})],

(8)

where

σ_{I}

is the integration scale,

σ_{D}

is the derivation scale, and

L_{x} (x, y, σ_{D})

together with

L_{y} (x, y, σ_{D})

are partial derivatives of Gaussian smoothed image I with

L_{x} (x, y, σ_{D}) = \partial (g (σ_{D}) * I) / \partial x

and

L_{y} (x, y, σ_{D}) = \partial (g (σ_{D}) * I) / \partial y

, and is Gaussian kernel with

g (Σ) = \exp (- (x, y) Σ^{- 1} (x, y)^{T} / 2 / 2 π \sqrt{| Σ |}

They select $σ_{I}$ by maximizing a normalized Laplacican [43 –45] with

σ_{I} = \arg max_{σ_{I}} | σ_{I}^{2} (L_{x x} (x, y, σ_{I}) + L_{y y} (x, y, σ_{I})) |

(9)

and the derivation scale

σ_{D} = s σ_{I}

[43 –45], where

s = \underset{s}{\arg max} (λ_{min} (μ) / λ_{max} (μ)), s \in {0.5, \dots, 0.75},

(10)

λ_{max} (μ)

is the largest eigenvalue of the μ, and

λ_{min} (μ)

is the smallest eigenvalue of μ.

Finally, they take points $(x, y)$ which meet $det (μ (x, y)) - k \times {t r a c e}^{2} (μ (x, y)) > t h r e s h o l d$ as affine invariant corners or interest points.

4.3. Difference of Gaussian detector

Lowe [40,41] proposed an efficient scale space extremes detection algorithm, called Difference-of-Gaussian (DoG) as a close approximation of the scale-normalized Laplacian of Gaussian to identify potential interest points that are invariant to scale and orientation.

Let $I (x, y)$ be one image, $G (x, y, σ)$ be a variable scale Gaussian [40,41] by

G (x, y, σ) = \exp (- (x^{2} + y^{2}) / 2 σ^{2}) / 2 π σ^{2},

(11)

and k be a constant multiplicative factor.

Then, the DoG function $D (x, y, σ)$ [40,41] is defined as

D (x, y, σ) = G (x, y, k σ) * I (x, y) - G (x, y, σ) * I (x, y,)

(12)

which involves the subtraction of one blurred image from another less blurred one. They divide each octave of scale space into s intervals, where

k = 2^{1 / s}

A sample point is selected as the local maxima or minima only if it has a largest or smallest DoG function value within all 3-by-3-by-3 neighbors. Furthermore, the efficient and scale invariant DoG has been widely used in interest point detection.

4.4. Learned FAST corner detector

Detectors based approaches can be also used in realtime frame rate applications. However, most traditional detectors, such as DoG detector and Harris detector, are not suitable for those realtime applications due to their computationally complexity. Under this particular background, Rosten et al. [50] proposed a high-speed machine learning based corner detection algorithm, called learned FAST (Features from Accelerated Segment Test) corner detector. The approach involved mainly two steps:

Firstly, use their segment test criterion to detect initially all corners from a set of target images T. Specifically, let x be one of sixteen pixels in a circle around the candidate corner p, $I_{x}$ and $I_{p}$ be their intensity. Then, p is a corner, only if $\forall x \in {1, 2, \dots, 16}$ , $I_{x} ⩾ I_{p} + t$ or $\forall x \in {1, 2, \dots, 16}$ , $I_{x} ⩽ I_{p} - t$ . In the end of the step, T is divided into a corner set c with size $n_{c}$ and a non corner set $\bar{c}$ with size $n_{\bar{c}}$ .

Secondly, selected the x which yields the most information [50] by

x = \arg max_{x \in {1, 2, \dots, 16}} {H (P) - H (P_{d | x}) - H (P_{s | x}) - H (P_{b | x})},

(13)

where P is the set of all pixels in all training images,

H (P) = (n_{c} + n_{\bar{c}}) \log_{2} (n_{c} + n_{\bar{c}}) - n_{c} \log_{2} n_{c} - n_{\bar{c}} \log_{2} n_{\bar{c}}

(14)

[50] is the entropy of P which is further partitioned into a darker pixels subset

P_{d | x}

, similar pixels subset

P_{s | x}

, and a brighter pixels subset

P_{b | x}

when using the intensity of x as partition condition. Then, for each subset, use its most information x to divide this subset into three subsets, until the entropy of a subset is zero, i.e. the subset contains all corners or all non-corners.

Besides, Rosten et al. [50] also verified by experiments that their detector can not only improve the speed, but also obtain the quality for realtime corner detection applications.

4.5. Deformable parts model detector

Object detector, like the interset point detector, can be also used as the feature detector. Recently, Felzenszwalb et al. [14,16] proposed a state-of-the-art latent SVM object detector, which is also a mixture of deformable parts model (DPM).

Let $D = {(x^{(i)}, y^{(i)}) | 1 ⩽ i ⩽ n}$ be labeled training examples, where $x = [x^{(1)}, x^{(2)}, \dots, x^{(n)}]$ be these examples, and $y = [y^{(1)}, y^{(2)}, \dots, y^{(n)}] \in {0, 1}^{1 \times n}$ be their labels with $y^{(i)} = 0$ if $x^{(i)}$ is a negative example or $y^{(i)} = 1$ if $x^{(i)}$ is a positive example.

Then, they construct the classification function of the latent SVM object detector [16], as

f_{β} (x) = max_{z \in Z (x)} β \cdot Φ (x, z,)

(15)

where

f_{β} (x)

is the score or the classification result of the mixture model on an example x, z is a latent variable vector,

β = (β_{1}, β_{2}, \dots, β_{n_c o m p o n e n t})

is a parameter vector,

β_{i}

is a parameter vector of the i-th component,

Φ (x, z) = (Φ (x, z_{1}), \dots, Φ (x, z_{n_c o m p o n e n t}))

is a feature vector of the mixture model, and

Φ (x, z_{i})

is a feature vector of i-th component.

In order to use the classification function of the latent SVM model for object detection, they trained the model parameters β by minimizing the $↕_{2}$ regularized hinge loss function [16], as

L_{D} (β) = ‖ β ‖^{2} / 2 + C \sum_{i = 1}^{n} max (0, 1 - y^{(i)} f_{β} (x^{(i)})) .

(16)

Object detection by using the latent SVM object detector has already achieved state-of-the-art results on many datasets, such as the PASCAL and INRIA person datasets.

5. 3D detectors

5.1. Harris3D detector

Laptev and Lindeberg [32,33,37] proposed a Harris3D corner detector, which is a spatial temporal generalization of the well-known Harris corner detector [21], for extracting 3D interest points from key frames of given video clips.

Let $I_{t}$ be a frame at time t. Then a video clip can be denoted by $V = (I_{1}, I_{2}, \dots, I_{t}, \dots, I_{n})$ . The task of the Harris3D corner detector is to find spatial temporal points which have significant changes in x, y and t. These points can be found by a second moment matrix [33], as

(x, y, t) = g (\cdot; σ, τ) * (V_{x}^{2}, V_{x} V_{y}, V_{x} V_{t}; V_{x} V_{y}, V_{y}^{2}, V_{y} V_{t}; V_{x} V_{t}, V_{y} V_{t}, V_{t}^{2}) .

(17)

The matrix

H (x, y, t)

is composed of first order spatial derivatives

V_{x}

V_{y}

and first order temporal derivative

V_{t}

averaged using a spatial-temporal separable Gaussian weighting function [33]

g (\cdot; σ, τ) = \exp (- (x^{2} + y^{2}) / 2 σ^{2} - t^{2} / 2 τ^{2}) / \sqrt{(2 π) σ^{4} τ^{2}},

(18)

where

σ^{2}

is a spatial scale and

τ^{2}

is a temporal scale.

The two parameters can be obtained by maximizing a normalized spatial-temporal Laplace operator [33], as

(σ, τ) = \arg max_{σ, τ} {\nabla_{norm}^{2} V}

(19)

and then

(σ, τ) = \arg max_{σ, τ} {σ^{2} τ^{1 / 2} (V_{x x} + V_{y y}) + σ τ^{3 / 2} V_{t t}} .

(20)

Let

R (x, y, t) = det (H (\cdot)) - k \times t r a c e^{3} (H (\cdot))

(21)

be a Harris3D response function which is defined in Eq. (8) of paper [32]. Then, they regard

(x, y, t)

as an interest point if

R (x, y, t)

is larger than a threshold.

5.2. Cuboid detector

Although the generalized Harris3D is quite effective at detecting spatio-temporal corners, there are mainly two issues [9], which are:

(1)
true spatio-temporal corners are quite rare.
(2)
spatio-temporal corners are not always the features one needs for general action recognition.

Dollar et al. [9] solved these issues by proposing a spatio-temporal Cuboid detector for the action recognition. The Cuboid detector is based on both spatial Gaussian filters and temporal Gabor filters.

The response function of the Cuboid detector has the form
$R (x, y, t) = {(I * g * h_{ev})}^{2} + {(I * g * h_{od})}^{2},$
(22)
where $g (x, y, σ)$ are spatial Gaussian filters, and $h_{ev}$ together with $h_{od}$ are temporal Gabor filters defined by $h_{ev} (t, τ, ω) = - \cos (2 π t ω) e^{- t^{2} / τ^{2}}$ and $h_{od} (t, τ, ω) = - \sin (2 π t ω) e^{- t^{2} / τ^{2}}$ .

Besides, they use $ω = 4 / τ$ , resulting in a spatial parameter σ and a temporal parameter τ for the detector. Finally, local maxima of the response function are selected as interest points.
5.3. Hessian3D detector

According to Willems et al. [64], the generalized Harris3D [32,33,37] is a time-consuming algorithm for sparse features and the Cuboid [9] generates scale variant features.

Thus, Willems et al. [64] proposed a spatio-temporal efficient dense yet scale-invariant feature detector, namely Hessiand3D detector. They searched for local dense extrema on the determinant of the Hessian and approximated all 3D convolutions by box-filters based on the integral video for the efficient purpose.

Let $V (\cdot)$ be a video clip, $L (\cdot; σ, τ)$ [64] be the Gaussian smoothed video with

L (\cdot; σ, τ) = g (\cdot; σ, τ) * V (\cdot),

(23)

g (\cdot; σ, τ)

be a variable scale Gaussian,

L_{x}

(similar for

L_{y}

and

L_{t}

) be the first order partial derivative of the video, and

L_{x x}

(similar for

L_{x y}

, etc.) be the second order partial derivative of the video.

Then, the spatio-temporal Hessian [64] is defined as

(x, y, t) = (L_{x x}, L_{x y}, L_{x t}; L_{y x}, L_{y y}, L_{y t}; L_{t x}, L_{t y}, L_{t t}) .

(24)

The estimated scales

\tilde{σ}

and

\tilde{τ}

[64] can be selected by maximizing the scale-normalized determinant of the Hessian, as

(\tilde{σ}, \tilde{τ}) = \arg max_{σ, τ} {{(det H)}^{norm}}

(25)

and then

(\tilde{σ}, \tilde{τ}) = \arg max_{σ, τ} {σ^{4} τ^{2} L_{x x} L_{y y} L_{t t}} .

(26)

Furthermore, they let correct scale parameters σ and τ be $(σ, τ) = \sqrt{3 / 2} (\tilde{σ}, \tilde{τ})$ . Finally, spatio-temporal interest points are the extrema of the strength matrix $S = | det H (x, y, t) |$ over a specified threshold.

5.4. Spatiotemporal DPM detector

Tian et al. [55] proposed a spatiotemporal deformable part model (SDPM) for action detection by generalizing the state-of-the-art 2D deformable part models [14,16]. The SDPM detector selected automatically the most discriminative 3D sub-volumes as parts and employed the volumetric HOG3D [29] descriptor.

Let $(x, y, t, l)$ be a point at level l of the HOG3D feature pyramid, $F_{0}$ be the $3 \times 3 \times 3 \times 20$ dimensional root filter, $F_{i}$ be $3 \times 3 \times 1 \times 20$ dimensional part filters, $α (x, y, t, l)$ be $3 \times 3 \times 3 \times 20$ dimensional features at $(x, y, t, l)$ , $β (x_{i}^{^{'}}, y_{i}^{^{'}}, t_{i}^{^{'}}, l)$ be $3 \times 3 \times 1 \times 20$ dimensional features at $(x_{i}^{^{'}}, y_{i}^{^{'}}, t_{i}^{^{'}}, l)$ , $(x_{i}, y_{i}, t_{i})$ be an anchor position of the i-th part, and Z be latent variables indicating the set of all possible part locations.

Then, the score of a detection volume at $(x, y, t, l)$ [55] is the sum of root filter score on the volume and part filter scores of best possible sub-volumes, as

s c o r e (x, y, t, l) = F_{0} \cdot α (x, y, t, l) + \sum_{1 ⩽ i ⩽ n} max_{(x^{'}, y^{'}, t^{'}) \in Z} [F_{i} \cdot β (x_{i}^{'}, y_{i}^{'}, t_{i}^{'}, l) - ε (i, X_{i})],

(27)

where

ε (i, X_{i}) = d_{i} \cdot X_{i}^{T}

are deformation costs,

d_{i}

are coefficients, and

X_{i} = [| x_{i}^{'} - x_{i} |, | y_{i}^{'} - y_{i} |, | t_{i}^{'} - t_{i} |, {| x_{i}^{'} - x_{i} |}^{2}, {| y_{i}^{'} - y_{i} |}^{2}, {| t_{i}^{'} - t_{i} |}^{2}]

(28)

are deformation features [55].

Based on the sliding sub-volume approach, an action is detected at the given spatiotemporal location if a detection volume scores above a threshold.

5.5. Dense sampling detector

Both Fei-Fei et al. [11] and Jurie et al. [26] have shown that, for natural scene recognition tasks or natural object recognition tasks, dense sampling detector plus unsupervised learning or explicit discriminative feature selection usually give better results than key-point detector based approaches, since the latter were unable to select the most informative regions. In order to select informative regions in three-dimensional video clips, Wang et al. [61] recommended to use dense sampling detector.

Let V be a video clip, $w_{V}$ be its frame width, $h_{V}$ be its frame height, $l_{V}$ be its frame number, and quintuple $(x_{i}, y_{i}, t_{i}, σ_{j}, τ_{k})$ be a video block to be extracted from the clip V, where $x_{i}$ is its i-th horizontal position, $y_{i}$ is its i-th vertical position, $t_{i}$ is its i-th time position, $σ_{j}$ is its j-th spatial scale, $τ_{k}$ is its k-th temporal scale, $σ_{1} = 18$ pixels, and $τ_{1} = 10$ frames.

Firstly, set $(x_{1}, y_{1}, t_{1}) = (9, 9, 5)$ , $j \in {1, 2, \dots, 8}$ , and $k \in {1, 2}$ . Secondly, set

i \in {1, 2, \dots, (| 2 w_{V} / σ_{j} | - 1) (| 2 h_{V} / σ_{j} | - 1) (| 2 l_{V} / τ_{k} | - 1)} .

(29)

Thirdly, for each

(i, j, k)

, compute

(x_{i}, y_{i}, t_{i}, σ_{j}, τ_{k})

[61], where

x_{i} - x_{i - 1} = σ_{j - 1} / 2, y_{i} - y_{i - 1} = σ_{j - 1} / 2, t_{i} - t_{i - 1} = τ_{k - 1} / 2, σ_{j} = \sqrt{2} σ_{j - 1},

(30)

and

τ_{k} = \sqrt{2} τ_{k - 1}

. Finally, all 3D patches

(x_{i}, y_{i}, t_{i}, σ_{j}, τ_{k})

with size

σ_{j} \times σ_{j} \times τ_{k}

are densely sampled.

6. Deep learning detectors

6.1. Faster R-CNN detector

During the past decade, the convolutional neural network, known as CNN, has witnessed the rapid knowledge renew in many reasoning based research areas, such as machine learning and computer vision. In 2012, Krizhevsky et al. [30] relied on their Deep CNN to win the image classification task on ILSVRC 2012 dataset. In order to bridge the gap between image classification and object detection, Girshick et al. [19] proposed an R-CNN (Regions with CNN features) detector in 2014. Once was supervised pre-trained and domain-specific fine-tuned, the R-CNN can be adopted to detect interested objects based on three steps: (1) Take an image as input. Then, generate category-independent region proposals based on Selective Search [58]. After that, warp those regions to fixed size. (2) Extract a fixed-length feature vector from each of the warped region proposal using the Caffe CNN [1], and use non-maximum suppression together with bounding box regression to return revised region proposals. (3) Classify extracted feature vectors using the pre-trained linear SVM. The R-CNN has improved the object detection performance dramatically, compared to previous best result on Pascal VOC 2012.

However, Girshick [18] pointed out that R-CNN is slow because it performs multi-stage expensive training without sharing computation. As a result, fast R-CNN was born, which performs a single-stage less expensive training based on a multi-task loss. Operation procedures of the approach are: (1) Take an image as input and adopt Selective Search [58] to generate region proposals. (2) Use CNN to extract a fixed-length feature vector from each of the region proposal with several convolutional layers and max pooling layers. (3) Apply CNN to output both a probability estimate and a four-dimension bounding box position for each of the candidate object classes, based on an integrated softmax estimator and an integrated bounding box regressor. Compared to designated work, fast R-CNN does improve training and testing speed while maintain an increasing detection accuracy.

Ren et al. [47] found that the inelegant object proposal extraction is the computational bottleneck of the fast R-CNN detector. They further design a new detector, called faster R-CNN [47], which replaces the time consuming Selective Search in the previous proposed fast R-CNN with their new integrated region proposal network (RPN). Their contributions lie mainly in two folds: (1). Instead of generating region proposals with graph-based complex image segmentation approach [15] together with a hierarchical greedy region grouping strategy, the RPN simply collect features from the output of convolutional layers by a 3-scale-3aspect ratio sliding network, i.e. total 9 different sliding windows. (2). The integrated RPN can also share those convolutional layers with the CNN. The authors verified by experiments that the faster R-CNN detector produces not only detection accuracy better but also detection speed faster, than the strong baseline.

Faster R-CNN detector has also been applied in related action and event recognition tasks and shown excellent performance. Weinzaepfel et al. [63] proposed a weakly-supervised method for action localization, which adopts firstly faster R-CNN detector to extract human tubes from videos, secondly improved dense trajectories [60] to describe those tubes, and finally multi-fold MIL [6] to select most possible tubes as action localization.

6.2. T-CNN detector

Hou et al. [22] proposed Tube convolutional neural network (T-CNN) for action detection, which is a unified deep net-work based on 3D convolutional network (ConvNet) features [57]. To capture the spatio-temporal information in the video and detect the differentiated action, T-CNN generalizes faster R-CNN [47] from 2D image regions to 3D video tubes. The approach involved mainly three steps:

Firstly, an 8-frame video is divided into clips of equal length, and the clips are input into tube proposal network (TPN) [22]. In particular, according to the conv5 feature cube of 3D convolutional network (ConvNet), a set of bounding box proposals is generated for each clip in the video. K-means clustering is applied to select anchor bounding boxes. It is also suitable for different datasets. An “actionness” score related to each bounding box is used to measure the correlation of the content to a valid action. Bounding boxes with actionness scores greater than or equal to the threshold are selected as positive bounding box proposals. Since temporal max pooling concentrates time from 8 frames to 1 frame, the time series of the original 8 frames is lost. So temporal skip pooling is used to inject temporal ordering for frame-level detection. Tube of interest (ToI) Pooling merges variable-sized tube proposals and bounding box proposals into a fixed feature shape. Finally, the synthesized tube proposals are output as the TPN, which also marks the spatiotemporal action localization of the input video.

Secondly, considering the actionness scores of each clip and the overlap between adjacent proposals, each tube proposal is calculated with a score defined as follows [22]:

S = (1 / m) \sum_{i = 1}^{m} A c t i o n n e s s_{i} + (1 / (m - 1)) \sum_{j = 1}^{m - 1} O v e r l a p_{j, j + 1},

(31)

where

{A c t i o n n e s s}_{i}

denotes the actionness score of the tube proposal from the i-th clip,

{O v e r l a p}_{j, j + 1}

measures the overlap between the linked two proposals respectively from the j-th and (

j + 1

)-th clips, and m is the total number of video clips. Tube proposals with the highest scores are linked as a set of sequences, which represent potential action instances.

Lastly, the linked tube proposal sequences are sorted by size. Then, in order to perform spatio-temporal action detection, tube of interest (ToI) pooling [22] is used to extract feature vectors of a fixed shape and a fixed duration as an action label.

Besides, Hou et al. [22] also verified by experiments that their detector can not only classify and locate actions in videos but also improve the better performance through extensive assessment of T-CNN action detection in clip and un-clipped videos compared to state-of-the-arts.

6.3. STEP detector

Spatio-temporal-action detection requires recognize target action categories in videos and locating them both spatially and temporally. For the deep learning framework, it brings some new challenges, which is the need for action tube-based proposals to include spatial displacement. Therefore, it increases the difficulty of proposal generation and more accurate localization, and in order to obtain more accurate action classification, a network model framework with robust temporal modeling capability is required, since multiple actions can only be recognized when temporal context information is available. Under these challenges background, Yang et al. [67] proposed a progressive learning framework for spatio-temporal action detection in action videos, called spatio-temporal progressive (STEP) action detector. It mainly includes the following two steps:

Firstly, for the initial 11 proposals Yang et al. [67] gave for each clip, an additional regression branch that can achieve adaptive temporal expansion through position expectation is additionally trained, based on the assumption that there is only a small residual between two adjacent clips. Let x [67] be the feature sent to the position regressor

L^{s} = f (x)

(32)

output layer f in step s,

L^{s}

be the position regressor of the current clip, where

L_{- 1}^{s} = L^{s} + f_{- 1} (x)

(33)

and

L_{+ 1}^{s} = L^{s} + f_{+ 1} (x)

(34)

are the position of proposal

B_{- 1}^{s}

and

B_{+ 1}^{s}

is expected to calculate the result before decoding, where

f_{- 1}

and

f_{+ 1}

are expected regressors with small changes whose effects are negligible. The proposal

B^{s}

is thus extended to adjacent clips to include longer temporal context information. This step solves the problem that the proposal will change with the spatial displacement.

Secondly, two different branches are used for spatial refinement, a global branch for the task of action classification on the input video sequence clips, and a local branch for the task of bounding box regression for each frame. ROI pooling is used for regional feature extraction and input into the global branch to generate global features, global features encoding the contextual information of the pipeline are used to output classification predictions $p_{i}^{s}$ , where $p_{i}^{s}$ is the probability distribution of the ith proposal on the action of class c plus the background. the local feature is obtained by connecting the regional feature corresponding to each frame with the global feature, and is used for the localization regression output of a specific action class $l_{i}^{s}$ , where $l_{i}^{s}$ is the localization loss of the ith proposal in the action of class c. Then these two different branches are jointly trained, and all proposals are iteratively calculated for the above two steps and updated using a greedy algorithm.

In summary, Yang et al. [67] achieved excellent performance on spatio-temporal action detection by using only a small number of proposals in the proposed STEP detector.

6.4. SSN detector

In the temporal action detection in untrimmed videos, the detector not only needs to accurately determine the category, but also needs to determine that the current action instance is a complete instance, i.e., the time when the action begins and ends needs to be accurately detected. In the previous detectors, the features are often built on the average pooling without any stage attributes, resulting in the detection of each discriminative fragment related to the target action instance, which cannot accurately locate a complete action instance. Zhao et al. [69] proposed a structured segment network (SSN), a framework for modeling each action instance using a structured temporal pyramid (STTP). They divide each proposal into three stages in SSN, i.e., starting, course and ending. Then two classifiers are applied to category classification and whether it is a complete instance. The main steps are as follows:

For each input video V is divided into T snippets, each snippet is divided into N proposals [69], i.e.,

p_{i} = [s_{i}, e_{i}]_{i = 1}^{N},

(35)

where

s_{i}

and

e_{i}

denote the beginning and end of the proposal respectively. Zhao et al. [69] enhanced the initial

p_{i}

on the time length, i.e.,

p_{i}^{'} = [s_{i}^{'}, e_{i}^{'}], s_{i}^{'} = s_{i} - (e_{i} - s_{i}) / 2, e_{i}^{'} = e_{i} + (e_{i} - s_{i}) / 2.

(36)

Then each proposal is divided into three stages: starting, course and ending [69], i.e.,

p_{i}^{s} = [s_{i}^{'}, s_{i}], p_{i}^{c} = [s_{i}, e_{i}] and p_{i}^{e} = [e_{i}, e_{i}^{'}] .

(37)

Finally, the time pyramid is used to calculate the feature vector [69]

f_{i} = [f_{i}^{s}, f_{i}^{c}, f_{i}^{e}]

(38)

of the three stages, and the two-stream feature representation [51] is used to connect them to a global feature representation. For course stage, there are usually rich action results, so they especially use the dual pyramid for feature detection of course stage.

In the action class classifier, they limit the detection range to the course stage, so the $f_{i}^{c}$ obtained in the previous step is used to predict the $K + 1$ class (including background). The results are normalized by the softmax layer and the probability $P (c_{i} | p_{i})$ , where $c_{i}$ is the class label of the class i is output. In the integrity classifier, a series of binary classifiers are used to classify each action category based on the global feature representation and output $P (b_{i} | c_{i}, p_{i})$ , where $b_{i}$ represents the probability that $p_{i}$ contains the proposal of the complete action instance, when $c_{i} ⩾ 1$ the above two outputs are combined into a joint probability output.

The SSN detector proposed by Zhao et al. [69] achieved more excellent performance than state-of-the-art by using STTP and two classifiers used to predict the activity category and action completeness, respectively. The SSN detector can not only locate the temporal boundary of the complete action instance more accurately, but also be applied to the action category detection with multiple temporal structures.

6.5. TACNet detector

Extracting valid long-term action context information and enabling the model to accurately learn accurate action states, rather than ambiguous action states (before and after the target action state time and similar to the target action state), are important factors for improving the accuracy of spatio-temporal action detection. In the previous method proposed by Kalogeiton et al. [27], only short-term action context information of up to 10 frames can be extracted, and the problem caused by the ambiguous action state cannot be solved, i.e., if the ambiguous action state is regarded as the target action state Or the background to let the model train, which will greatly reduce the performance of the model detection. Song et al. [52] defined this ambiguous action state as “transition state” and proposed Transition-Aware Context Network (TACNet), in which the two parts of temporal context detector and transition-aware classifier solve the above two questions.

Song et al. [52] designed the temporal context detector based on the standard SSD [38] framework, and used the same two-stream SSD structure as the action tubelet detector (ACT) [27] to build the action detection pipeline. The difference is that Song et al. [52] embed a Bi-directional Conv-LSTM(Bi-ConvLSTM) [36] between every two adjacent different scales to construct a recurrent detector capable of extracting temporal contextual features for action detection. At the same time, considering that the input video will appear forward and backward, the Bi-ConvLSTM used by Song et al. [52] is a pair of Bi-ConvLSTM with temporal symmetry. They changed the activation function of Bi-ConvLSTM from tanh to ReLU to get better performance, and used a $1 \times 1$ Conv to connect and transform a pair of features extracted by Bi-ConvLSTM to eliminate channels redundancy.

The main purpose of the Transition-aware classifier is to classify the action category and the action state, and to be able to distinguish whether the current motion state is a “transition state”. In order to solve the optimization conflict and coupling problem between the training objectives of action classification and action state classification, Song et al. [52] used $c_{i}^{+} + c_{i}^{-}$ and $c_{i}^{+} - c_{i}^{-}$ respectively to predict the action category and action state, i.e.,

p^{i} = e^{c_{i}^{+} + c_{i}^{-}} / (\sum_{j \in [0, K]} e^{c_{j}^{+} + c_{j}^{-}})

(39)

and

t^{i} = e^{c_{i}^{+} - c_{i}^{-}} / (e^{c_{i}^{+} - c_{i}^{-}} + 1),

(40)

where is

p^{i}

is the probability of the action category and

i \in [0, K]

t^{i}

ids the probability of the target action state and

i \in [1, K]

, so the probability of transition state is

1 - t^{i}

, K is the number of action categories, and

c_{0}^{+}

represents the background category. On the issue of how to label the transition state, Song et al. [52] proposed a “simple-mining” strategy to label the positive samples of the transition state, i.e., when the predicted score satisfies

c_{i}^{+} > c_{0}^{+}

, it is detected is transition state.

TACNet showed excellent performance on two public datasets UCF101-24 and J-HMDB-21 and surpassed the state-of-the-art, and Song et al. [52] also embedded the temporal context detector and transition-aware classifier. It also achieves better performance than the original on other detectors, i.e., SSD and Deconvolution-SSD [17].

6.6. Privacy-preserving detector

A privacy-preserving detector proposed by Ren et al. [49] can learn to obscure faces in the videos and ensure the accuracy of action detection. It has more advantages than traditional manual anonymous methods. Through this action detection, the facial privacy information can be protected, and the action can be identified by the detector.

The network framework of the privacy-preserving detector mainly includes a face modifier from Johnson et al. [25], a spatial action detector, and a face identity classifier. The face modifier can change the face image in the video. The spatial action detector can accurately detect the action in the video. Besides, the face identity classifier uses the adversarial classification loss to ensure that the modification of the face classifier is unable to be identified. The specific process of the privacy-preserving detector is as follows:

(1) Face detection is performed from input frames of a given video set to obtain face regions and face images. Put the collected face area into the face modifier facial modifier [25] and modify it to a new face image.

(2) The Spatial Action Detector generates an action detection loss function by using Fast R-CNN [48] networks to accurately detect the actions in the video.

(3) The face identity classifier performs adversarial classification to ensure that the modified face cannot be identified as the real identity. Among them, the classifier uses the angular softmax loss [39] to continuously optimize anonymous face and achieve the high accuracy of face verification. In addition, the modified video uses a photo-realistic loss: L1 loss [23,70], which preserves human-identifiable information such as scene actions.

(4) Add three loss functions to form the final complete objective formula. The formula iteratively updates the values of the face modifier and the action detector and the face classifier to realize privacy protection. The three loss functions are added to form the final complete formulation. The formulation iteratively updates the values of the face modifier and the action detector, then continuously adjusts the face classifier to achieve privacy protection.

6.7. Coarse-to-fine action detector

In previous spatial-temporal action detection, most of them are frame-by-frame detection and then connect the frame-by-frame detection results, so that not only local information is used but also frame-by-frame computation results in very low efficiency. Li et al. [35] proposed a trained end-to-end framework, called coarse-to-fine action detector (CFAD), which can efficiently detect spatial-temporal actions. They proposed a new concept that firstly estimates a coarse spatio-temporal pipeline for the video stream, and then refines the corresponding pipeline by using the key timestamp. The main modules to achieve this concept are coarse module and refine module. The specific steps are as follows:

In coarse modules, Li et al. [35] defined each action instance in the input video stream as a set of

A = {(t_{j}, b_{j},) j = 0, \dots, T_{a} - 1},

(41)

where

t_{j}

is the timestamp of a frame,

b_{j}

is the four parameters

x_{j}

y_{j}

w_{i} ~and~ h_{j}

of the corresponding action frame in the frame, and

T_{a}

represents the total number of all boundary frames in the ground-truth pipeline. They used 3D-CNN to extract the spatio-temporal features of the input, and then obtained the proposal with only time information through the temporal proposal network (TPN). In the coarse module, two different convolution branches were designed to deal with the time residual information and spatial information respectively. Finally, the Head module constructed by 3D-CNN and Non-Local Block [62] fused the output information of the two branches through the average pooling. Li et al. [35] estimating the coarse acting tubes by parametric modeling,

A^{'} (t; θ) \to R^{4}

A^{'}

[35] is the estimation of the coarse spatial position, i.e.,

[x (t), y (t), w (t), h (t)],

(42)

where t is the normalized time stamp, and θ is the time trajectory parameter. The change in the pipeline of an action example is usually smooth and gradual, so the high-order polynomial [35] is used for simulation, i.e.,

A^{'} (t : θ) = [θ_{x}^{T} t, θ_{y}^{T} t, θ_{w}^{T} t, θ_{h}^{T} t],

(43)

where the vector t contains the order of the current time stamp. It is worth noting that the absolute coordinate is not directly estimated here, but the relative coordinate of the matched boundary frame is estimated.

After completing the calculation of $A^{'} (t : θ)$ , its position will be further refined in Refine module. Li et al. [35] used the key timestamp and coarse pipe to refine in the refinement module. They designed an additional selective network for dynamic sampling of the key timestamp, i.e., using a 1D hourglass network to compress the input characteristics, and then output an importance fraction $p_{i}$ , when $p_{i} ⩾ α$ , it was marked as the key timestamp and then the next refinement was carried out. Finally, the action pipe and action classification score obtained by interpolation between the refinement box and the unrefined box were output.

The CFAD proposed by Li et al. [35] not only has a new paradigm concept, but also does not require frame-by-frame dense detection, which can be said to be a very efficient detector. They not only achieved state-of-the-art results on benchmark data such as UCF101-24, and JHMDB-21, but also tripled their speed.

6.8. TxVAD detector

Due to the complexity of video datasets, most action recognition with detector methods in recent years is very complex and contains many specific components, such as person detector and region proposal network (RPN). Transformer was designed to process time series data and was first used in the field of natural language processing [59]. Since transformer has been proven to outperform other convolutional neural network-based methods in the image field, it has received more and more attention in the field of computer vision [10]. Video is a kind of time series data. For temporal modeling capability of transformer, Wu et al. [65] proposed a simple framework without using specific components, named transformer-based video action detector (TxVAD).

TxVAD consists of three parts: 3D-CNN backbone and two pure transformers for action location and action recognition. Given a consecutive sequence of (2L + 1) video clips, a 3D-CNN (I3D [5] or SlowFast [12]) backbone computes feature map $F \in R^{(2 L + 1) \times T \times D \times H \times W}$ , where T, D, H, W stands for temporal, transformer model, height and width dimensions, respectively. the feature map $F (i, ⌊ T / 2 ⌋, \dots)$ representing the middle frame of the i-th video clip is input to person transformer (PTx) to get bounding boxes for location. For the i-th clip, the temporal channels of the 4D feature map $F (i, \dots)$ is averaged with temporal pooling, and 5D feature map F is transformed to 4D feature map $F_{t} \in R^{(2 L + 1) \times D \times H \times W}$ . Like RoI-pooling of faster R-CNN, spation-temporal RoI-pooling (ST-RoI-Pool) is used to extract feature Q of persons of interest, which are contained by bounding boxes from PTx. Finally, feature map $F_{t}$ , Q and the result computed by PTx and ST-RoI-Pool of the center frame of the center video clip, are fed into action transformer (ATx) to classify action. In order to ensure the stability of training and performance of validation, TxVAD adopts a training strategy, named hardness-aware curriculum training strategy. In other words, TxVAD first uses simple samples to train, and then uses harder samples to train again until samples are used up.

TxVAD reduces many complex components and proposes a simple action recognition with detector framework, but it only uses a transformer instead of the detector or RPN function, and still needs to use ROI pooling. It is difficult to optimize TxVAD containing two transformers, and the training strategy for TxVAD is not an end-to-end strategy.

6.9. TubeR detector

An end-to-end approach is able to reduce the specific components of the model, thereby reducing the complexity of the model. Detection transformer (DETR) [2] is an end-to-end object detector that has inspired many computer vision researchers. Zhao et al. [68] proposed tubelet transformer (TubeR), which extended DETR from image data processing to video data processing.

TubeR, similar to DETR, converts visual tasks into sequence-to-sequence tasks, which is what Transformer is good at. Unlike TxVAD, TubeR only consists of 3D-CNN backbone and a transformer to complete action location and classification. Given a video Clip k of the $T_{in}$ frame, which is first divided into a series of patches with a size of $l \times s \times s$ , the 3D-CNN backbone extract the video feature $F_{b} \in R^{T_{b} W H \times C}$ After $F_{b}$ is added with position encoding, it is input through TubeR’s Encoder–Decoder, and the video features are converted into behavior tube features $F_{t u b} \in R^{T_{o} \times N \times C}$ . The behavior tube feature $F_{t u b}$ obtains the position of the behavior tube in the Regression Head and removes the behavior tubes that do not exist, and obtains the behavior category of each behavior tube in the Action Head. $T_{b}$ , C, W, H, $T_{O}$ , N stands for temporal, feature, width, height and number of action tubes dimension, respectively.

Table 2
Comparison of deep learning detectors.

Reported paper Year Method GFlops UCF 101-24 J-HMDB-21 AVA

CNN [47] Ren et al. 2015 Faster R-CNN detector - - - -

[49] Ren et al. 2016 Privacy-preserving detector - - - -

[69] Zhao et al. 2017 SSN detector - - - -

[22] Hou et al. 2017 T-CNN detector - - 76.9 -

[52] Song et al. 2019 TACNet detector - 72.1 73.4 -

[67]} Yang et al. 2019 STEP detector - 75.0 - 18.6

[35] Li et al. 2020 Coarse-to-fine action detector - 69.7 83.7 -

Transformer [65] Wu et al. 2022 TxVAD detector 47.5 79.5 79.8 34.0

[68] Zhao et al. 2022 TubeR detector 240 83.2 82.3 33.6

	Reported paper	Year	Method	GFlops	UCF 101-24	J-HMDB-21	AVA
CNN	[47] Ren et al.	2015	Faster R-CNN detector	-	-	-	-
[49] Ren et al.	2016	Privacy-preserving detector	-	-	-	-
[69] Zhao et al.	2017	SSN detector	-	-	-	-
[22] Hou et al.	2017	T-CNN detector	-	-	76.9	-
[52] Song et al.	2019	TACNet detector	-	72.1	73.4	-
[67]} Yang et al.	2019	STEP detector	-	75.0	-	18.6
[35] Li et al.	2020	Coarse-to-fine action detector	-	69.7	83.7	-
Transformer	[65] Wu et al.	2022	TxVAD detector	47.5	79.5	79.8	34.0
[68] Zhao et al.	2022	TubeR detector	240	83.2	82.3	33.6

TubeR uses a transformer to solve the two problems of person lactation and action classification, instead of requiring two transformers like TxVAD. TubeR is a more thorough end-to-end approach than TxVAD. Despite the excellent performance of TubeR, there are still some shortcomings. TubeR uses 3D-CNN backbone to process low-level information and transformer to process high-level semantic information. The computing overhead of TubeR mainly comes from the 3D-CNN backbone, and because it is based on the action tube to classify behavior, the video is too long will lead to a rapid increase in its memory overhead, these two shortcomings limit the processing of TubeR in long video data.

Table 2 shows the comparison of deep learning algorithm in Section 6. In Table 2 we introduce the frame-level mean average precision with an IoU threshold of 0.5 for UCF101-24 and AVA dataset, while video-level mean average precision with same IoU threshold for J-HMDB-21.

7. Conclusion

Detectors, mainly feature detectors, are used in traditional action and event recognition. The purpose of traditional feature detectors is to detect points, lines or corners, such as 2D detectors and 3D detectors. Due to the development of convolutional neural networks in recent years, deep learning-based detectors, are extensively used in action and event recognition. They are quite different from traditional methods. Action and event recognition methods based on deep learning tend to integrate feature extractors including feature detectors, and classifiers into one method, namely end-to-end method. Due to this change, feature detectors are being deliberately mentioned less and less frequently in related papers. Due to the strong generalization ability achieved through a series of convolution kernels, deep learning-based detectors are easily used to detect whole regions of actions or events. Traditional feature detectors need to artificially design features for tasks. Although traditional feature detectors may be computationally efficient in some simple scenarios, they are less accurate for complex scenarios. The detectors based on deep learning has strong versatility and high precision for complex scenes, but it has high data requirements. Because the action detection locates the behavior occurrence area and reduces the interference of the background, the performance of the action and event recognition accuracy is better than that of simple action and event recognition. Eliminating distractions is important, but it is also important to ensure that information is adequate. While improving the performance of the detector algorithm, it is also necessary to pay attention to the privacy protection issue in the training process of the deep learning algorithm. Because the powerful detection ability of deep learning may lead to privacy leakage problems. In recent years, due to the emergence of Transformer, a simpler end-to-end framework can be designed for video datasets in multi-person scenes. To sum up, this paper believes that detector-based behavior and event recognition algorithms can be developed from the following three aspects: (1) detectors and privacy protection, (2) action and event recognition based on transformer, and (3) end-to-end method. We hope that this review can provide help for the development of action and event recognition algorithms, so that action and event recognition algorithms can be more intelligent and serve intelligent monitoring, pension industry and other services.

Footnotes

Acknowledgements

We would like to thank anonymous reviewers for helpful comments.

References

Caffe

Y.J.

, An open source convolutional architecture for fast feature embedding, 2013, http://caffe.berkeleyvision.org/.

Carion

et al., End-to-End Object Detection with Transformers, European Conference on Computer Vision, Springer, Cham, 2020.

Carreira

et al., A short note about kinetics-600, 2018, arXiv preprint arXiv:1808.01340.

Carreira

et al., A short note on the kinetics-700 human action dataset, 2019, arXiv preprint arXiv:1907.06987.

Carreira

Zisserman

, Quo vadis, action recognition? A new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Cinbis

R.G.

Verbeek

Schmid

, Weakly supervised object localization with multi-fold multiple instance learning, TPAMI (2016).

Damen

et al., Scaling egocentric vision: The epic-kitchens dataset, in: Proceedings of the European, Conference on Computer Vision (ECCV), 2018.

Damen

et al., Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100, International Journal of Computer Vision 1(23) (2022).

Dollár

Rabaud

Cottrell

Belongie

, Behavior recognition via sparse spatio-temporal features, in: 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance(VS-PETS), 2005.

10.

Dosovitskiy

et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.

11.

Fei-Fei

Perona

, A Bayesian hierarchical model for learning natural scene categories, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2005.

12.

Feichtenhofer

et al., Slowfast networks for video recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

13.

Feichtenhofer

Pinz

Zisserman

, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

14.

Felzenszwalb

P.F.

Girshick

R.B.

McAllester

Ramanan

, Object detection with discriminatively trained part-based models, in: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 32, 2010, pp. 1627–1645.

15.

Felzenszwalb

P.F.

Huttenlocher

D.P.

, Efficient graph-based image segmentation, International journal of computer vision (IJCV), 59 (2004), 167–181.

16.

Felzenszwalb

P.F.

McAllester

Ramanan

, A discriminatively trained, multiscale, deformable part model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.

17.

C.Y.

Liu

Ranga

Tyagi

Berg

A.C.

, Dssd: Deconvolutional single shot detector, 2017, arXiv preprint arXiv:1701.06659.

18.

Girshick

, Fast r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.

19.

Girshick

Donahue

Darrell

Malik

, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

20.

Sun

Ross

D.A.

et al., Ava: A video dataset of spatio-temporally localized atomic visual actions, in: Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

21.

Harris

Stephens

M.J.

, A combined corner and edge detector, Alvey Vision Conf. 15 (1988), 147–151.

22.

Hou

Chen

Shah

, Tube Convolutional Neural Network (T-CNN) for action detection in videos, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5822–5831.

23.

Isola

Zhu

J.Y.

Zhou

Efros

A.A.

, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

24.

Jhuang

Gall

Zuffi

et al., Towards understanding action recognition, in: Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, 2013.

25.

Johnson

Alahi

Fei-Fei

, Perceptual losses for real-time style transfer and super-resolution, in: European Conference on Computer Vision, Springer, Cham, 2016.

26.

Jurie

Triggs

, Creating efficient codebooks for visual recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2005.

27.

Kalogeiton

Weinzaepfel

Ferrari

Schmid

, Action tubelet detector for spatio-temporal action localization, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4405–4413.

28.

Kang

S.M.

Wildes

R.P.

, Review of action recognition and detection methods, 2016, arXiv preprint arXiv:1610.06906.

29.

Klaser

Marszałek

Schmid

, A spatio-temporal descriptor based on 3d-gradients, in: BMVC 2008-19th British Machine Vision Conference, 2008.

30.

Krizhevsky

Sutskever

Hinton

, ImageNet classification with deep convolutional neural networks, neural information processing systems (NIPS) (2012).

31.

Kuehne

Jhuang

Garrote

et al., HMDB: A large video database for human motion recognition, in: Proceedings of the 2011 International Conference on Computer Vision, 2011.

32.

Laptev

, On space-time interest points, International journal of computer vision (IJCV), 64(2–3) (2005), 107–123.

33.

Laptev

Lindeberg

, Space-time interest points, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2003.

34.

et al., The ava-kinetics localized human actions video dataset, 2020, arXiv preprint arXiv:2005.00214.

35.

Lin

See

Yan

Yang

, CFAD: Coarse-to-fine action detector for spatiotemporal action localization, in: European Conference on Computer Vision (ECCV), 2020, pp. 510–527.

36.

Gavrilyuk

Gavves

Jain

Snoek

C.G.

, Videolstm convolves, attends and flows for action recognition, Computer Vision and Image Understanding 166(41–50) (2018), 41–50. doi:https://doi.org/10.1016/j.cviu.2017.10.011.

37.

Lindeberg

, Feature detection with automatic scale selection, International journal of computer vision (IJCV), 30 (1998), 79–116.

38.

Liu

Anguelov

Erhan

Szegedy

Reed

C.Y.

Berg

A.C.

, Ssd: Single shot multibox detector, in: Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 21–37.

39.

Liu

Wen

Raj

Song

, Sphereface: Deep hypersphere embedding for face recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

40.

Lowe

D.G.

, Object recognition from local scale-invariant features, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), Vol. 2, 1999, pp. 1150–1157.

41.

Lowe

D.G.

, Distinctive image features from scale-invariant keypoints, International journal of computer vision (IJCV), 60 (2004), 91–110.

42.

Mazzia

et al., Action transformer: A self-attention model for short-time pose-based human action recognition, Pattern Recognition 124 (2022), 108487. doi:https://doi.org/10.1016/j.patcog.2021.108487.

43.

Mikolajczyk

et al., A comparison of affine region detectors, International journal of computer vision (IJCV), 65(1–2) (2005), 43–72.

44.

Mikolajczyk

Schmid

, An affine invariant interest point detector, in: European Conference on Computer Vision (ECCV), 2002, pp. 128–142.

45.

Mikolajczyk

Schmid

, Scale & affine invariant interest point detectors, International journal of computer vision (IJCV), 60 (2004), 63–86.

46.

Oneata

Verbeek

Schmid

, Action and event recognition with Fisher vectors on a compact feature set, in: Proceedings of the IEEE International Conference on Computer Vision, 2013.

47.

Ren

Girshick

Sun

, Faster R-CNN: Towards real-time object detection with region proposal networks, neural information processing systems (NIPS) (2015), 91–99.

48.

Ren

Lee

Y.J.

, Cross-domain self-supervised multi-task feature learning using synthetic imagery, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

49.

Ren

Lee

Y.J.

Ryoo

M.S.

, Learning to anonymize faces for privacy preserving action detection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.

50.

Rosten

Drummond

, Machine learning for high-speed corner detection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2006, pp. 430–443.

51.

Simonyan

Zisserman

, Two-stream convolutional networks for action recognition in videos, in: NIPS, 2014, pp. 568–576.

52.

Song

Zhang

Sun

, Transition-aware context network for spatio-temporal action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11987–11995.

53.

Soomro

Zamir

A.R.

Shah

, UCF101: A dataset of 101 human actions classes from videos in the wild, 2012, arXiv preprint arXiv:1212.0402.

54.

Szeliski

, Computer Vision: Algorithms and Applications, Springer, 2010.

55.

Tian

Sukthankar

Shah

, Spatiotemporal deformable part models for action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

56.

Tong

et al., Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, Advances in neural information processing systems 35 (2022), 10078–10093.

57.

Tran

Bourdev

Fergus

Torresani

Paluri

, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.

58.

Uijlings

van de Sande

Gevers

Smeulders

, Selective search for object recognition, International journal of computer vision (IJCV) (2013).

59.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

Polosukhin

, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017.

60.

Wang

Oneata

Verbeek

Schmid

, A robust and efficient video representation for action recognition, International journal of computer vision (IJCV) (2015).

61.

Wang

Ullah

M.M.

Klaser

Laptev

Schmid

, Evaluation of local spatio-temporal features for action recognition, in: BMVC 2008-19th British Machine Vision Conference, 2009, pp. 124.1–124.11.

62.

Wang

Girshick

Gupta

, Non-local neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7794–7803.

63.

Weinzaepfel

Martin

Schmid

, Towards weakly-supervised action localization, 2016, arXiv preprint arXiv:1605.05197.

64.

Willems

, An efficient dense and scale-invariant spatio-temporal interest point detector, in: European Conference on Computer Vision (ECCV), 2008.

65.

et al., TxVAD: Improved video action detection by transformers, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022.

66.

Yang

et al., Recurring the transformer for video action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

67.

Yang

Liu

M.Y.

Xiao

Davis

L.S.

Kautz

, Step: Spatio-temporal progressive learning for video action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 264–272.

68.

Zhao

et al., TubeR: Tubelet transformer for video action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

69.

Zhao

Xiong

Wang

Tang

Lin

, Temporal action detection with structured segment networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2914–2923.

70.

Zhu

J.Y.

Park

Isola

Efros

A.A.

, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017.