Machine learning for video event recognition

Abstract

In recent years, the spread of video sensor networks both in public and private areas has grown considerably. Smart algorithms for video semantic content understanding are increasingly developed to support human operators in monitoring different activities, by recognizing events that occur in the observed scene. With the term event, we refer to one or more actions performed by one or more subjects (e.g., people or vehicles) acting within the same observed area. When these actions are performed by subjects that do not interact with each other, the events are usually classified as simple. Instead, when any kind of interaction occurs among subjects, the involved events are typically classified as complex. This survey starts by providing the formal definitions of both scene and event, and the logical architecture for a generic event recognition system. Subsequently, it presents two taxonomies based on features and machine learning algorithms, respectively, which are used to describe the different approaches for the recognition of events within a video sequence. This paper also discusses key works of the current state-of-the-art of event recognition, providing the list of datasets used to evaluate the performance of reported methods for video content understanding.

Keywords

Machine learning event recognition video analysis image processing behaviour understanding

1. Introduction

Nowadays, video sequences are commonly available from various sensors, and frequently used in different domains. In the field of computer vision, visual-based event recognition systems are raising a great interest in a wide range of possible application areas, including image processing [1, 2, 3, 4], change detection [5, 6], health care [7, 8, 9, 10, 11], video surveillance [12, 13], and many others. Some of the most important and useful applications are: crowd and abnormal behaviour analysis, traffic monitoring and crime prevention [14, 15]. Video sensor networks are usually useful as tools for human operators to observe areas of interest remotely. The technological advancement of video cameras in terms of spatial resolution (e.g., 2 k or 4 k), temporal resolution (e.g., 30 fps or 60 fps), type of sensor (e.g., RGB or RGB-D) and movement (i.e. pan, tilt, and zoom) has led the computer vision scientific community to design smart algorithms capable of understanding the semantic content of videos [16, 17] and assisting the human operators actively in scene monitoring [18, 19, 20]. The understanding of video content is discerning the behaviour of moving subjects acting with each other or motionless elements, in the observed scene. The analysis of unmoving objects is relevant because some gestures can yield different actions according to such elements. In any case, the automatic event recognition remains an ambitious and complex task, due to both the challenge of interpreting correctly the actions performed by the observed subjects and the presence of disturbing factors (e.g., occlusions or illumination changes). The latter, though, are not addressed in this work as they are not a focus of the present survey. However, the limits of the state-of-the-art methods discussed are properly reported.

The analysis of action sequences performed by subjects within an observed scene is the first step to understand automatically the different events that may occur. Formally, in literature, events are defined as real-world occurrences unfolding over space and time, yielded by subjects in the scene [21, 22]. With regard to event recognition in videos, events can be generally defined as one or more actions performed by subjects within the scene and at a certain time interval. Some examples of events are: a person opening a door, two persons shaking hands, a car driving along a road, and so on. Depending on the number of subjects involved and the type of interactions, events can be classified as simple or complex [23, 24]. Simple events are defined as a set of primitive actions (e.g., walking or sitting) performed by one subject. Complex events are defined as a series of simple events, spatially and temporally correlated, that lead two or many subjects to interact between each other. The recognition of events is generally performed by using two well-known branches of computer science: computer vision and machine learning. The former allows to extract relevant information (i.e., features) from the video frames analysed, while the latter allows to learn and, subsequently, infer the semantic meaning of the video by relying on the previously extracted features. Notice that, once trained to understand a specific set of events, a recognition system should always work in real-time. This aspect is a key factor in many application fields, including traffic monitoring and public security.

In literature, several surveys have explored the human behaviour in different contexts, such as interaction with social media (e.g., Twitter or Facebook) [25]. In recent years, methods based on the visual recognition of events triggered by moving subjects have been proposed as well [26, 27]. Nonetheless, the current state-of-the-art lacks several key factors, such as: formalizations for both scene and event, a practical key element for those wanting to understand the main aspects of an event recognition system quickly (hereinafter ER system), good and complete guidelines on the development of smart algorithms for such type of task. Therefore, unlike previous works, the proposed survey introduces two taxonomies based on features and machine learning algorithms, respectively, that classify the current approaches used to handle events in videos. The paper also provides a careful analysis of selected works, summarizing their main characteristics (i.e., features, algorithms, type of events and datasets) in concise tables. Based on a bottom-up approach, the proposed survey starts by formally defining scene and event, and continues by reasoning on event effects and then by describing the logical architecture for a generic ER system. In particular, the paper is structured as follows: Section 2 provides formalizations of both scene and event, reasoning on event effects, and sequentially presenting the aforementioned logical architecture of a generic ER system; Section 3 introduces the two taxonomies defined and highlights the role of the considered features and machine learning algorithms; Section 4 provides a set of selected key works in the current state-of-the-art, even summarized with some observations in concise tables; Section 5 discusses the features and machine learning algorithms used for the automatic event recognition task. Finally, Section 6 concludes the paper providing observations on the contributions of the proposed survey.

2. Key elements of an event recognition system

This section provides the formal definitions of a scene and of simple and complex events. Moreover, it introduces the formalism to represent events and reason about them and their effects by using the dialect of event calculus for run-time reasoning (RTEC) [28], based on event calculus [29]. With such tools, every ER method can be translated into a formal framework. Finally, it describes a logical reference architecture highlighting the main components of a generic ER system.

2.1 Scene definition

Events are triggered by different subjects that act in an environment defined as a scene. A scene can be classified as indoor scene or outdoor scene [30]:

•
an indoor scene is defined as a covered area of any shape and type. Typical examples of an indoor scene are rooms, hangars, stations, airports, and others. But also the interior part of a bus, of a train, of an airplane, and so on;
•
an outdoor scene is defined as an open area of any shape and type. Typical examples of an outdoor scene are port areas, airstrips, outdoor parking lots, runways, border areas, city centre (e.g., squares, streets or parks), and others.

Both classes of scene are described by a set of main properties, including the environment (e.g., natural or artificial) and noise (e.g., intrinsic or extrinsic). Concerning the environment, any type of scene is usually characterized by different static objects that are independent of the moving ones acting in it. In outdoor scenes, some examples are parking booths, street lamps, trees, and others. Instead, some examples in indoor scenes are represented by walls and furniture. In any case, these objects have several attributes (e.g., dimension or shape) that should be taken into account during the design process of an ER system. Concerning noise, in any type of scene, intrinsic noises are defined as any changes that may generally occur. Some examples in outdoor scenes involve the movement of a flying flag, the opening and closing on an electric gate, illumination changes, shadows, and others. Similar examples can be reported in indoor scenes. However, this type of noise is usually well-handled and does not influence the capability of a system to recognize events. In fact, intrinsic noises are characterized by repetitive and well-known patterns that can be removed by using popular algorithms (e.g., Gaussian models [31]). Extrinsic noises are related to elements not strictly belonging to the scenario, but which can occur occasionally. Typical examples are uncommon weather conditions such as snow, rain, and fog. However, modern computer vision algorithms handle this kind of noise properly [32, 33]. Even small moving objects (e.g., insects) that occlude the sensor are an example of an extrinsic noise. Actually, the latter is still considered an open issue. At each time instant, each scenario can present one or more noise elements related to the types previously introduced.

From a formal point of view, a scene can be defined by three layers that represent three different states of the environment observed. The first layer is the reference scenario ( $R_{S}$ ) that denotes the starting configuration of the observed area, including the static structures. Let $I$ be a time interval, with $I=[t_{0},t_{0}+\Delta t]$ and where $t_{0}$ is a fixed time instant, during the interval $I$ a reference scenario $R_{S}$ can be defined as follows:

$\displaystyle R_{S}(I)=\{r_{1},r_{2},\dots,r_{N}\},$ (1)

where $r_{i}\forall i\in[1,N]\subset\mathbb{N}$ is the $i^{\text{th}}$ static element within the scenario and $N$ is the total number of elements. Notice that, this layer includes only the static elements that are immutable for the entire time interval $I$ (i.e., permanent elements), such as trees, traffic lights or walls. As expected, for the reference scenario, $I$ represents a very long interval of time (e.g., weeks, months or even years). The static elements of this layer can be characterized by a set of unique features which can be considered by the system with the aim to understand the basic parameters of the scenario (e.g., dimension or localization of a tree). Subsequently, these features can be used by the same system to discern the moving subjects from the scenario correctly. Let $F^{i}_{J}\forall i\in[1,|R_{S}|]\subset\mathbb{N}$ be the set of features associated to the $i^{\text{th}}$ static element of the scenario, then the set of features can be defined as follows:

$\displaystyle F^{i}_{J}=\{f^{i}_{1},f^{i}_{2},\dots,f^{i}_{j}\},$ (2)

where $f^{i}_{j}\forall j\in[1,J]\subset\mathbb{N}$ is the $j^{\text{th}}$ feature extracted from the static element $i$ , while $J$ is the total number of features extracted from the element $i$ .

The second layer is the background scenario ( $B_{S}$ ). Let $S=\{s_{1},s_{2},\dots,s_{m}\}$ be the set of $m\in\mathbb{N}$ static elements that are part of the scenario occasionally (e.g., parked cars) and let $I_{S}=[t_{s},t_{s}+\delta t_{s}]$ , where $t_{s}$ is a fixed time instant, be the time interval in which the elements of $S$ appear in the observed area, then the background scenario $B_{S}$ is composed of the union between $R_{S}$ and the set of elements in $S$ . Formally, it can be defined as:

$\displaystyle B_{S}(I_{S})=\{R_{S}(I)\cup S(I_{S})\},I_{S}\subset I,$ (3)

where $R_{S}(I)$ and $S(I_{S})$ are the reference scenario at the time interval $I$ and the set of static elements at the time interval $I_{S}$ , respectively. Unlike $I$ , $I_{S}$ is a shorter interval measured in minutes, hours or days.

Besides the static elements, at each time instant, the scenario can include subjects that may interact with the latter or with each other. Formally, the set of subjects can be defined as:

$\displaystyle\Gamma(t_{\gamma})=\{\gamma_{1},\gamma_{2},\dots,\gamma_{K}\},t_{% \gamma}\in I_{\Gamma},$ (4)

where $\gamma_{k}\forall k\in[1,K]\subset\mathbb{N}$ the $k^{\text{th}}$ subject that appears in the scenario, $K$ is the total number of subjects, and $I_{\Gamma}=[t_{\gamma},t_{\gamma}+\delta t_{\gamma}]$ , with $t_{\gamma}$ a fixed time instant, is the time interval in which the subjects appear in the observed environment. Such interval is shorter than $I_{S}$ , so it is usually measured in seconds or minutes. As for the static elements, also the subjects are characterized by a set of features. Let $P^{i}_{J}\forall i\in[1,|\Gamma(t_{\gamma})|]\subset\mathbb{N}$ be the set of features associated to the $i^{\text{th}}$ subject of the scenario, then these features can be defined as follows:

$\displaystyle P^{i}_{J}=\{p^{i}_{1},p^{i}_{2},\dots,p^{i}_{j}\},$ (5)

where $p^{i}_{j}\forall j\in[1,J]\subset\mathbb{N}$ is the $j^{\text{th}}$ feature extracted from the subject $i$ , while $J$ is the total number of features extracted from the subject $i$ .

These concepts allow to provide the definition of the last layer, specifically, the dynamic scenario ( $D_{S}$ ). Given a background scenario $B_{S}(t_{\gamma})$ , at a fixed time instant $t_{\gamma}$ , and given a set of subjects $\Gamma(t_{\gamma})$ acting in the same scenario, at the same time instant $t_{\gamma}$ , the dynamic scenario $D_{S}(t_{\gamma})$ can be defined as follows:

$\displaystyle D_{S}(t_{\gamma})=\{B_{S}(t_{\gamma})\cup\Gamma(t_{\gamma})\}.$ (6)

Table 1
RTEC dialect

Predicate Meaning

happensAt( $E$ , $T$ ) Event $E$ occurs at time $T$ .

holdsAt( $F=V$ , $T$ ) The value of fluent $F$ is $V$ at time $T$ .

holdsFor( $F=V$ , $I$ ) $I$ is the list of the maximal intervals for which $F=V$ holds continuously.

initiatedAt( $F=V$ , $T$ ) At time $T$ a period of time for which $F=V$ is initiated.

terminatedAt( $F=V$ , $T$ ) At time $T$ a period of time for which $F=V$ is terminated.

relative_complement_all_( $I^{\prime}$ , L, I) $I$ is the list of the maximal intervals produced by the relative complement of the list of maximal

intervals $I^{\prime}$ with respect to every list of maximal intervals of list $L$ .

union_all( $L$ , $I$ ) $I$ is the list of the maximal intervals produced by the union of the lists of maximal intervals of list $L$ .

intersection( $L$ , $I$ ) $I$ is the list of the maximal intervals produced by the intersection of the lists of maximal intervals of

list $L$ .

Following the formal definition of the three scenarios, the next paragraph provides the formal definition of events. In the state-of-the-art, events are usually divided into simple events and complex events. The reported examples of simple and complex events were extracted from public datasets often used to support the evaluation step of ER systems [34, 35, 36].
2.2 Simple events

Predicate	Meaning
happensAt( $E$ , $T$ )	Event $E$ occurs at time $T$ .
holdsAt( $F=V$ , $T$ )	The value of fluent $F$ is $V$ at time $T$ .
holdsFor( $F=V$ , $I$ )	$I$ is the list of the maximal intervals for which $F=V$ holds continuously.
initiatedAt( $F=V$ , $T$ )	At time $T$ a period of time for which $F=V$ is initiated.
terminatedAt( $F=V$ , $T$ )	At time $T$ a period of time for which $F=V$ is terminated.
relative_complement_all_( $I^{\prime}$ , L, I)	$I$ is the list of the maximal intervals produced by the relative complement of the list of maximal
	intervals $I^{\prime}$ with respect to every list of maximal intervals of list $L$ .
union_all( $L$ , $I$ )	$I$ is the list of the maximal intervals produced by the union of the lists of maximal intervals of list $L$ .
intersection( $L$ , $I$ )	$I$ is the list of the maximal intervals produced by the intersection of the lists of maximal intervals of
	list $L$ .

Simple events describe a set of primitive actions performed by subjects. In this type of events, subjects perform actions by interacting with the environment, but not with each other. The actions performed by a subject, if temporally and spatially correlated, trigger a simple event. Formally, let $\gamma_{i}\in\Gamma(I_{\Gamma})$ be a subject $i$ that appears in the scene, for the whole time interval $I_{\Gamma}$ or part of it, and let $j$ be the index of the $j^{\text{th}}$ action performed by the same subject, within the same time interval, then a simple event that occurs within the time interval $I_{\Gamma}$ can be defined as follows:

$\displaystyle E^{i,j}_{\textit{simple}}=\bigcup_{j=1}^{N_{a}}a_{j,\gamma_{i}},$ (7)

where $a_{j,\gamma_{i}}$ is the $j^{\text{th}}$ action associated to the subject $\gamma_{i}$ within the time interval $I_{\Gamma}$ , and $N_{a}$ is the total number of actions performed by the subject. Figure 1a and b provide two examples of a simple event.

Figure 1.

Examples of (a)(b) simple events and (c)(d) complex events.

The RTEC dialect, summarized in Table 1, can be used to represent simple events and to model their effects. Considering the riding horse event in Fig. 1b, such event can be formalized as follows:

$\displaystyle\textit{initiatedAt}(\textit{ridingHorse}(H_{1})=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{riding}(H_{1})=\textit{true}),T),$ (8) $\displaystyle\textit{happensAt}(\textit{riding}(H_{1})=\textit{true}),T+1),$ $\displaystyle\dots$

In the rule Eq. (2.2), the fluent $\textit{riding}(H_{1})=\textit{true}$ is an instantaneous event indicating the riding of a horse at time-point $T$ . Moreover, the horse moves every second. Notice that, as shown below for complex events, the maximal intervals for which $\textit{ridingHorse}(H_{1})=\textit{true}$ holds continuously can be defined by using the $\textit{holdsFor}()$ predicate.

2.3 Complex events

Complex events can be defined as a set of simple events that are temporally and spatially correlated among each other. Taking into account the time interval $I_{\Gamma}$ , the complex events which may occur within it can be formally defined as:

$\displaystyle E^{T_{\Gamma}}_{\textit{complex}}=\bigcup_{i=1}^{|\Gamma(I_{% \Gamma})|}\bigcup_{j=1}^{|E_{\textit{set},j}|}E^{i,j}_{\textit{simple}},$ (9)

where $E_{\textit{simple}}^{i,j}$ is defined as in the previous Section 2.2, $i$ is the index of the $i^{\text{th}}$ subject that acts in the scenario, $\Gamma(I_{\Gamma})$ is the whole set of subjects that appear in the scene within time interval $I_{\Gamma}$ , $j$ is the index of the $j^{\text{th}}$ action associated to the $i^{\text{th}}$ subject and, finally, $E_{\textit{set},j}$ is the set of simple events triggered by the $i^{\text{th}}$ subject. Notice that, usually, various subjects act within a scene, and any of them can trigger a simple event. A complex event occurs, if and only if, two (or more) simple events coming from two (or more) different subjects are, in turn, spatially and temporally correlated. Therefore, the definition reported in Eq. (9) has to be considered under such constraints. Figure 1c and d provide two examples of complex events.

Just as for simple events, the RTEC dialect can be used to represent complex events and to model their effects. Considering the hugging event in Fig. 1d, such event can be formalized as follows:

$\displaystyle\textit{initiatedAt}(\textit{hugging}(P1,P2)=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{start}(\textit{hug}(P1)=\textit{true})% ,T),$ $\displaystyle\textit{holdsAt}(\textit{hug}(P2)=\textit{true},T),$ $\displaystyle\textit{holdsAt}(\textit{close}(P1,P2)=\textit{true},T)$ $\displaystyle\textit{initiatedAt}(\textit{hugging}(P1,P2)=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{start}(\textit{hug}(P2)=\textit{true})% ,T),$ $\displaystyle\textit{holdsAt}(\textit{hug}(P1)=\textit{true},T),$ $\displaystyle\textit{holdsAt}(\textit{close}(P1,P2)=\textit{true},T)$ $\displaystyle\textit{initiatedAt}(\textit{hugging}(P1,P2)=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{start}(\textit{close}(P1,P2)=\textit{% true}),T),$ $\displaystyle\textit{holdsAt}(\textit{hug}(P1)=\textit{true},T),$ $\displaystyle\textit{holdsAt}(\textit{hug}(P2)=\textit{true},T)$ $\displaystyle\textit{terminatedAt}(\textit{hugging}(P1,P2)=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{end}(\textit{hug}(P1)=\textit{true}),T)$ $\displaystyle\textit{terminatedAt}(\textit{hugging}(P1,P2)=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{end}(\textit{hug}(P2)=\textit{true}),T)$ $\displaystyle\textit{terminatedAt}(\textit{hugging}(P1,P2)=\textit{true},T)\leftarrow$ $\displaystyle\textit{happensAt}(\textit{end}(\textit{close}(P1,P2)=\textit{% true}),T).$ (10)

In rule Eq. (2.3), it is possible to notice that, in antecedent blocks of rules, predicates $\textit{initiatedAt}()$ and $\textit{terminatedAt}()$ can be used to define constraints on fluents. The above rule states that two subjects, $P1$ and $P2$ , are embraced when they hug each other and are close. The functions $\textit{start}(F=V)$ and $\textit{end}(F=V)$ are built-in RTEC events to define each starting or ending point of each maximal interval for which the fluent $F=V$ holds continuously. Fluents can also be statically determined, then the value of a fluent $F$ can be computed as the value of other fluents by using the predicate $\textit{holdsFor}()$ . Considering the fencing event in Fig. 1c, the statically version of the event formalization can be defined as follows:

$\displaystyle\textit{holdsFor}(\textit{fencing}(P_{1},P_{2})=\textit{true}),I)\leftarrow$ $\displaystyle\textit{holdsFor}(\textit{fight}(P_{1})=\textit{true},I_{1}),$ $\displaystyle\textit{holdsFor}(\textit{fight}(P_{2})=\textit{true},I_{2}),$ $\displaystyle\textit{holdsFor}(\textit{close}(P_{1},P_{2})=\textit{true},I_{3}),$ $\displaystyle\textit{union\_all}([I_{1},I_{2}],I_{4}),$ (11) $\displaystyle\textit{intersect\_all}([I_{4},I_{3}],I_{5}),$ $\displaystyle\textit{holdsFor}(\textit{inactive}(P_{1})=\textit{true},I_{6}),$ $\displaystyle\textit{holdsFor}(\textit{inactive}(P_{2})=\textit{true},I_{7}),$ $\displaystyle\textit{relative\_complement\_all}(I_{5},[I_{6},I_{7}],I).$

According to rule Eq. (2.3), two subjects are fencing as long as at least one of them is fighting, the other is inactive, and they are close. The interval manipulation constructs specified in RTEC are useful to define that: for all time-points $T$ , $F=V$ holds at $T$ , if and only if, some Boolean fluent-value pairs hold at $T$ . Moreover, the use of interval manipulation constructs allows to make concise definitions, with respect to the traditional Event Calculus representation, in the presence of a great number of fluents. By using the predicate $\textit{holdsFor}()$ , by identifying the conditions under which the fluent is initiated and terminated, it is possible to compute the maximal intervals.

Figure 2.

Logical architecture of a generic ER system.

2.4 Logical architecture of a generic ER system

Figure 2 shows an overview of the different modules that make up the logical architecture of a generic ER system. The scene acquisition module uses one or more video sensors to acquire the scene. Usually, these sensors are RGB cameras but, depending on the scene, they can also be RGB cameras with depth information (i.e., RGB-D), stereo cameras, infrared (IR) cameras, or a mixture of them all [37, 38]. To manage the recognition of events properly, an ER system should acquire all three scenarios $R_{S}$ , $B_{S}$ and $D_{S}$ . The layer $R_{S}$ can be considered as the ground truth of the observed scene, while the layers $B_{S}$ and $D_{S}$ can be considered an update of the ground truth within a fixed interval of time and the scene in which the events are detected, respectively. During the acquisition, or once acquired the entire video sequence (for off-line and not real-time ER systems), a pre-processing stage performs some operations to improve the quality of the acquired frames and to enhance the details within them. Commonly, the image improvement consists in removing noise, while the detail enhancement consists in highlighting the main components (e.g., edges or corners).

Subsequently, the feature extraction module is used to extract relevant information frame-by-frame, in order to describe uniquely the different subjects appearing in the scene. Such features can be of different types (e.g., trajectories or histograms) and used separately or combined. The extracted features are used in two different modules, namely event learning and event recognition. The former learns and trains the event recognition module, which is updated and learned based on the newest features extracted from the observed scene. This step allows the ER system to recognize an increasingly bigger set of events. The latter, once trained, tries to classify the recognized events properly, always according to the extracted features. Any additional information about each classified event (e.g., kind of event, level of dangerousness, or number of subjects involved) can be provided to an operator by a human-computer interaction interface. Notice that, the dialogue among different modules and, specifically, among those designed to learn and classify events, depends on several factors, including the complexity of the ER system, the number of recognizable classes, the type of video sensors involved, and so on [39, 40]. The in-depth analysis of such factors is not a focus of this paper. However, the architecture highlights that it is necessary to consider two key factors in the implementation of an ER system: the features by which each scene is analysed, and the algorithms through which these features are processed in order to learn and classify the events. The next section provides two taxonomies based on features and machine learning algorithms, respectively.

3. Taxonomies for event recognition in videos

This section starts by proposing a first taxonomy based on the most used features for event recognition in video sequences. Subsequently, it continues by proposing a second taxonomy based on the most used machine learning algorithms for the same task. These criteria were chosen because features and algorithms are cornerstones of machine learning. The two taxonomies provide the keys to understanding how to describe and recognize actions or interactions performed by subjects in a video. We built such background maps by studying a broad set of selected works in the current literature to summarize the main aspects of the state-of-the-art methods, useful for the reader as a quick overview during the development of smart algorithms for event recognition in videos.

3.1 Taxonomy of features

The taxonomy shown in Fig. 4 depicts the different types of features used to describe actions or interactions performed by subjects in a video. Such features are often chosen by taking into account the algorithm used to learn and recognize events and the type of raw data acquired by the specific video sensor. However, the choice mostly depends on the type of event that needs to be classified. Features can be organized into two main sets: spatio-temporal features (STF) and image description features (IDF). Figure 3 shows examples of the different types of features discussed in this section, starting from the original image in Fig. 3a and proposed in [41].

Figure 3.

Examples of the different types of features for event recognition: (a) original image, (b) trajectory, (c) depth map, (d) skeleton joints, (e) Canny ED, (f) SURF, (g) SIFT, (h) HOG, (i) deep features.

Figure 4.

Taxonomy of the most used features for event recognition in videos.

3.1.1 Spatio-temporal features

The STF set includes all those features describing spatial and temporal relations either among subjects or between subjects and a scene. A feature frequently used is the trajectory of each subject shown in Fig. 3b. This type of feature can be extracted by using the pixel information [42, 34] acquired with a standard RGB sensor, or by using the skeleton information acquired with an RGB-D sensor [37].

A different approach is the use of description vectors. Such vectors can contain any type of data, including axis velocities [43, 42], spatio-temporal volumes [44], and a sequence of poses [36]. Besides, there are approaches based also on 3D data, making use of either local or global spatio-temporal features extracted from depth maps (Fig. 3c), or specific joints and measures of the skeletons (Fig. 3d). In several cases, the approaches mentioned above are also mixed by using standard 2D RGB images [45, 46, 47, 38, 40, 48].

3.1.2 Image description features

The IDF set includes all those features containing information about some interesting points of subjects and, in some cases, of a scene (e.g., edges, corners or others). In the current literature, several works are supported by image descriptors, such as Canny edge detector (Canny ED) [49] in Fig. 3e, speeded up robust feature (SURF) [50] in Fig. 3f, scale invariant feature transform (SIFT) [51] in Fig. 3g, or related modified versions. More specifically, SIFT and SURF feature extractors are used to detect a set of distinguishable local features (i.e., key points) from the images. Such extractors are designed to be reasonably invariant under scale, rotation, and translation of local features. Besides, with the aim to make these extractors robust enough under noise and illumination changes, different approaches adopt them only in regions of images having high contrast levels. Considering low-level features, in the state-of-the-art, several works use histogram of optical flow (HOF) [52], histogram of oriented gradients (HOG) [53] shown in Fig. 3h, and spatio-temporal gradients [36]. In general, such algorithms provide features similar to SIFT and SURF but, rather than extracting them from high contrast regions of the image, they are computed on a grid composed of cells spaced uniformly. To improve accuracy, an overlapping local contrast normalization is usually adopted.

Recently, deep learning techniques are often used to learn discriminative image-based features as shown in Fig. 3i. They are referred to as deep features [54, 55, 56], and each of them is the consistent output of a unit, within a hierarchical model, to the given input. The depth of these features depends on their position, along with their hierarchical structure. Moreover, they are non-linear, discriminant and invariant, performing well in image classification and target detection problems [57].

3.2 Taxonomy of machine learning algorithms

The taxonomy in Fig. 5 shows the different types of machine learning algorithms usually applied for event recognition in video sequences. Such algorithms can be divided into two main sets: supervised and unsupervised. An additional set concerns the approaches that can be classified either in the supervised or unsupervised set. For completeness, Section 3.2.3 describes some of these methods.

Figure 5.

Taxonomy of the most used machine learning algorithms for event recognition in videos.

3.2.1 Supervised approaches

The algorithms belonging to this set of methods require labelled training data or, specifically, the desired output of dataset instances. The support vector machine (SVM) [58, 34, 50, 36] is widely used in event detection and recognition. This type of approach aims to classify the observed data by finding the maximum hyperplane, in the feature space, separating elements belonging to different classes.

Another supervised method is boosting [59, 45], which uses several weak learners, or learners whose classification result is slightly better than a random guess, to create a stronger classifier whose classification result is strongly correlated with the ground truth.

Finally, the multiple instance learning (MIL) [60] can be considered a variation of the supervised learning paradigm. In this case, the algorithms receive a set of labelled instance bags during the training step, rather than a set of instances individually labelled.

3.2.2 Unsupervised approaches

The algorithms belonging to this set do not require any labelled training data since data labels are inferred by the actual algorithms. A common unsupervised approach, generally used in event recognition, is clustering [44, 51, 43]. Clustering algorithms aim to generate well-defined and homogeneous partitions of a feature set, usually composed of the most salient natures extracted from the observed scene.

Other types of unsupervised algorithms are the probabilistic approaches [61, 62], which can predict the class to which the observed data belong, given an input and a probability distribution for each instance.

3.2.3 Supervised and unsupervised approaches

As mentioned, another set of methods involves the algorithms which can be used either in a supervised or unsupervised way. Some well-known approaches adopted for event recognition, falling within this set, are artificial neural networks (ANNs) [63] and deep learning strategies [64].

The ANN are algorithms inspired by the aspect of the biological neural network and its hierarchical structure. The artificial networks consist of layers of connected artificial neurons and are used to learn the relationships between input/output data and patterns [65]. Such neurons are computational units of the hierarchical model, and each layer represents a level in the hierarchy. There is no limit on the number of layers. Owing to the recent improvement of the hardware computational capacity, the addition of more and more layers has led to create more and more deep ANNs.

Deep learning involves the construction of machine learning models to learn hierarchical representations of data within a set of images, including a set of edges, a set of regions with particular properties, and so on. Notice that, in order to solve a fixed task, some representations are better than others (e.g., object segmentation or face recognition). The capability of deep learning to obtain salient features has been explored also for event recognition, achieving remarkable results [66, 67].

In the following section, the large set of state-of-the-art works, that allowed us to define the reported taxonomies, are examined in detail.

4. Research contributions

This section analyses the features and machine learning algorithms taxonomies and discusses the most relevant and recent works for event recognition organized on the basis of the features used. For each work, limits are reported, along with the combination of features and algorithms exploited. Moreover, the reported works are summarized in concise tables showing features and algorithms, recognizable events, datasets and application fields (Tables 2 and 3).

Table 2
Summary table of recognizable events with reference works and datasets used

Works	Events	Datasets	Works	Events	Datasets
[42]	Action recognition	Parking lot [125], Shopping center and University campus [126]	[55]	Action recognition	MultiTHUMOS [55]
[43]	Anomaly detection, traffic monitoring	MIT Traffic [127], SAIVT-Campus [128]	[107]	Anomaly detection	UCSD [35], Avenue [129]
[44]	Anomaly detection	UMN [130], UCSD [35]	[56]	Action recognition	50 Salads [131, 132, 133], Manipulation actions [56]
[36]	Action recognition	Weizmann [134], KTH [135], UCF-Sports [136, 137], UCF50 [138]	[95]	Action, gesture, and interaction recognition	NTU60 RGB+D [139], SBU Kinect interaction [140], ChaLearn Gesture [141, 142]
[45]	Action recognition	MSR-Action 3D [143]	[97]	Action and interaction recognition	NTU60 RGB+D [139], UTK Action 3D [144], SBU Kinect interaction [140]
[46]	Action recognition	MSR-Action 3D [143], UTK Action 3D [144], UCF Kinect [145]	[39]	Action and interaction recognition	UCF101 [146], ActivityNet [147], CCV [148], USAA [149, 150]
[47]	Action and gesture recognition	MSR-Action 3D [143], MSR Gesture 3D [151]	[100]	Action, gesture, and interaction recognition	OAD [152], ChaLearn Gesture [141, 142], PKU-MMD [153], G3D [154]
[40]	Action recognition	BUPT Arm Activity [40], CMU Arm and Finger Activity [40]	[124]	Interaction recognition	BIT-Interaction [155], VIRAT [156], UT-Interaction [157]
[38]	Action recognition	UTK Action 3D [144], Berkeley MHAD [158], ACT4 ${}^{2}$ [159], MSR Daily Activity 3D [160]	[101]	Traffic monitoring	VOT Challenges [161]
[51]	Action recognition	Hollywood [162], TRECVID SED [163]	[67, 66]	Interaction recognition	VIRAT [156], UT-Interaction [157]
[34]	Action recognition	HMDB51 [164], UCF50 [138]	[119]	Anomaly detection	UCSD [35]
[50]	Action recognition	Hollywood2 [165], Olympic Sports [166], HMDB51 [164], UCF50 [138]	[98]	Action and interaction recognition	UCF101 [146], NTU60 RGB+D [139], SBU Kinect interaction [140]
[52]	Anomaly detection	UMN [130], PETS2009 [167]	[168]	Anomaly detection	YouTube videos
[54]	Action, gesture, and interaction recognition	MIVIA action [169, 170], NATOPS gesture [171], SBU Kinect interaction [140], Weizmann [134]	[123]	Action recognition	AVA [172]
[103]	Action recognition	NTU60 RGB+D [139], NTU120 RGB+D [173], SYSU [174]	[175]	Anomaly detection	UCSD [35], UMN [130], Subway videos [176]
[177]	Anomaly detection, traffic monitoring	Track 4 NVIDIA AI CITY 2020 [178]	[179]	Interaction recognition	CAD , CAE , Volleyball

Table 3

Summary table of application fields, detectable subjects, used features and machine learning algorithms

Applications	Subjects	Features	Algorithms
General video surveillance [34, 36, 42, 51, 50, 66, 67, 54, 55, 56, 39, 124, 98, 123, 179]	People, objects	Description Vectors, Trajectory, SIFT/SURF, HOG/HOF, Deep Features	Deep Learning, Probabilistic, Clustering, SVM, Neural Networks
Indoor video surveillance [38, 45, 46, 47, 95, 97, 100, 103, 40]	People	3D Data, Deep Features	Clustering, SVM, Neural Networks, Boosting, Probabilistic
Traffic monitoring [43, 62, 101, 177]	People, vehicles	Trajectory, 3D Data, Deep Features	MIL, Clustering, Probabilistic, Neural Networks
Crowded scene/Outdoor video surveillance [44, 52, 107, 119, 168, 175]	People, vehicles	Description Vectors, Canny ED, HOG/HOF, Deep Features	Clustering, SVM, Neural Networks

4.1 Methods based on spatio-temporal features

This section describes some recent key works that use the spatio-temporal features to detect and recognize events.

4.1.1 Trajectory

Sun et al. [42] consider the trajectories generated by the sequences of movements of human subjects in the observed scenario for activity recognition. In detail, the authors propose a method for modelling the subjects’ trajectories by combining both beta process (BP) [68] and hidden Markov models (HMM) [69], thus generating a novel approach: the beta process hidden Markov models (BP-HMM). The motions are shared among trajectories using a set of movement labels, which correspond to the movements observed in the scenario, and a binary vector of features for each trajectory. The elements of such vectors are 1, if the trajectory contains the corresponding movement label, and 0 otherwise. The weights of state transition of the HMM are computed by using a Gamma distribution and, subsequently, the same weights are combined with the trajectory feature vector to obtain the transition distribution of the trajectory nodes. The sampling of each trajectory (i.e., the binary vector of features, the state transition of the HMM and the transition distribution) is performed with Markov chain Monte Carlo (MCMC) [70]. Finally, each trajectory is classified by taking the class that maximizes the log-likelihood of the trajectory probability. The limit of such method is given by the misclassification of events composed of flexible and different activities, or the misclassification of events that share motions.

Xu et al. [43] propose a method to detect events in crowded scenes by exploiting MPEG codec [71] to build motion vectors. The MPEG discrete cosine transform (DCT) coefficients are then used to compute a foreground map in order to remove the background trajectories. The following steps consist in transforming the trajectories into the Fourier domain and in quantizing the Fourier representations into visual words by using the k-means algorithm [72]. Besides, the authors propose a MIL algorithm that models the observed scenario as a linear sparse combination of independent events [73, 74], where each event is a distribution of visual words. The limit of this method is that the event of interest must be independent of the background activities.

4.1.2 Description vectors

Li et al. [44] propose a three-step framework for anomalous event detection and localization. The first step consists in dividing the considered video into spatio-temporal volumes of fixed size. From each volume, the spatio-temporal gradients are computed by using a first order differential approximation. Then, two histograms, for each gradient, are computed to be used as features. The extracted features are used as input for the fuzzy weighted c-means clustering algorithm [75], in which the elements are not assigned definitively to a class, but have different degrees of membership for each class. The second step consists in dividing the scenario into cubes, which are spatio-temporal windows of dimensions greater than volumes. In turn, the latter are divided into eight blocks. By using the bag of word (BoW) [76] approach, the membership grades, belonging to the same cluster, are grouped to the related blocks, thus obtaining a set of local behaviour patterns represented as histograms. Each histogram presenting different values with respect to the not anomalous local behaviour patterns previously computed can be considered as an anomalous event. In particular, in the last step, to detect the anomalous events, the authors use a sparse coding approach shown in [77]. To quantify the difference between an anomalous and a not anomalous event, the authors designed a cost criterion of sparse reconstruction. A typical event is characterized by a low reconstruction cost, while an anomalous event has a high reconstruction cost. The limit of this method is the dimension of volumes, which represents a key role in obtaining a satisfactory result in this type of event detection.

Moayedi et al. [36] propose an action recognition method in which the actions are represented as a sequence of poses. In particular, each single image (i.e., a frame) belonging to an action sequence is considered as a pose, on which the motion history image (MHI) [78] algorithm is computed. The MHI is a template in which the pixel intensity is a function of how recent the movement is in the sequence. The MHI is computed taking into consideration the frames immediately to the left and to the right of the one considered. Since the human body shape and the orientation of its parts provide discriminant information in many actions, the set of MHI images is represented as HOG descriptors and stored as a matrix. Next, a dictionary is created by using the BoW model. This dictionary is learned from a weighted linear combination of base vectors, which are obtained from a set of example videos. At this point, the matrix of HOG descriptor is encoded by using a sparse representation to encode input data structure. Finally, the sparse matrix is transformed into a histogram set, which represents the weight of each dictionary base in a video sequence. These histograms are the input for an SVM classifier. The limit of this approach is that if a more complex algorithm is used compared to HOG, it loses the real-time property.

4.1.3 3D data

Regarding the use of depth data, Chen et al. [45] propose a 3D action recognition method which combines unsupervised feature learning with the extraction of spatio-temporal features from unlabelled video data. In detail, two types of features are used: the local independent subspace analysis feature (LISA) and the global independent subspace analysis feature (GISA). The advantage of using these features is that they are both invariant to the translation of the human body, and robust to noise. The features are computed by extracting the depth subvolumes in a neighbourhood of the skeleton joints. Each subvolume is then normalized and whitened to reduce the input dimensionality. The next step consists in treating a subvolume as a sequence of depth images that are flattened and arranged in a vector. The latter represents the input of an independent subspace analysis (ISA) neural network [79, 80], which is trained similarly to the other algorithms described in [81, 82]. After the learning phase, two histograms (i.e., one for LISA features and one for GISA features) are associated to each joint by adopting a BoW approach. Since some joints are more important than others in recognizing specific actions, the authors used the adaboost method [59] to understand which joints are more relevant in the different types of actions. In detail, the proposed algorithm, named EnMkl (ensemble multi-kernel learning), boosts a set of multiple kernel learning support vector machines (MKLSVM) [58], and provides as output the combination between the movements of the acquired subject and the related actions, such as jogging, object throwing, hand clapping. The limit of this method is the recognition of actions that share similar movements.

Slama et al. [46] address the problem of action recognition from a purely geometrical point of view, where an action observability matrix is characterized as an element of a Grassmann manifold. Precisely, a two-step approach is proposed. As for [45], the joints of the subjects are the focal point to understand actions in a 3D environment. In the first step, the 3D spatial coordinates of the joints are extracted, and the temporal sequences reproducing the movements are built. Each movement sequence (i.e., an action) is represented as a matrix composed of $F$ rows and $T$ columns, where $T$ is the number of frames and $F$ is the number of features per frame. The latter is dependent on the number of the joints considered, and varies depending on the sensor used. An autoregressive and moving average (ARMA) [83] method is applied to the mentioned matrix to capture both spatial and temporal dynamics. At this point, a generic action is represented by the vector space that corresponds to the space of the columns in the observability matrix of the model, which can be identified as a Grassmannian point. In the second step, for each group of Grassmannian points belonging to the same action class, the Karcher mean [84] and the control tangent (CT) spaces are defined for such purpose. Besides, the machine learning algorithm used in this work is the SVM, which takes as input the representation, named local tangent bundle, of the projection on all the CT of a Grassmannian point that corresponds to an unclassified action. Therefore, an action can be recognized by combining CT spaces. The limits of this method are the dimension of the subspace and the classification of similar actions. Considering the first issue, it should be taken into account that a small dimension of the subspace causes a lack of information, while a bigger dimension tends to produce noise. As to the second issue, it is necessary to highlight that, due to the design of the method, very similar actions (e.g., horizontal arm wave and high arm wave) can be easily misclassified.

Chen et al. [47] present a framework for human action recognition, in which the depth feature representation is obtained with the fusion of 2D and 3D auto-correlation of gradient features. More specifically, three depth motion maps (DMM) representing front, side, and top views of a subject are created by using the projection-based algorithm reported in [85]. After the generation of the DMM, the gradient local auto-correlation (GLAC) [86] features are extracted. The next step consists in extracting local relationships among space-time gradients using the STACOG [87] algorithm. The last step consists in performing a weighted fusion, at decision-level, to combine the 2D DMM-based GLAC with the 3D STACOG auto-correlation of gradient features. The machine learning technique adopted is the extreme learning machine (ELM) [88], which is an efficient algorithm for single hidden layer feed-forward neural networks (SLFNs), applied in various application fields [89, 90, 91]. The GLAC and STACOG features, individually, are given as input to two ELM classifiers, and the probability outputs, from each classifier, are combined to generate the outcome (non-negativity and sum-to-one constraints are imposed). Although the result obtained is good, classifier weights are determined and fixed on the basis of training and testing samples, respectively.

Furthermore, the work proposed by Fan et al. [40], merges the RGB and depth data to recognize the human behaviour. A first step consists in a video pre-processing to correctly derive the max outline of the history behavior binary image (MOHBBI) characteristics. For each piece of RGB video and depth video, the visual background extractor (ViBe) algorithm [92] is applied to differentiate the foreground and the background, thus generating binary images. Next, the union operation on each binary image is performed to obtain the correspondent MOHBBI for the RGB and depth image sequences. In order to retain the maximum information of the RGB-MOHBBI and Depth-MOHBBI, and with the aim to remove noise, an intersection operation is performed on the two MOHBBI, thus creating the mixed MOHBBI. A uniform local binary pattern (ULBP) [48] is applied to the latter to extract the local texture features. The same background subtraction and binarization methods are used to process the RGB and depth image sequences and to get the spatial-temporal local texture features. Finally, the 3D image volume is projected on 2D planes in order to extract features that can represent the human activity features in the spatio-temporal domain. Concerning machine learning algorithms, both K-nearest neighbour [93] and HMM are used to detect the activities. Unfortunately, this approach is suitable only for upper limb activities.

Zhang and Parker [38], for human activity recognition, propose a new local spatio-temporal feature, called 4D color-depth (CoDe4D), which incorporates both the intensity and depth information acquired by RGB-D cameras. The CoDe4D multichannel feature detector is based on a saliency map, which allows to extract interest points in the xyzt space (i.e., local pose, shape, and texture variations in the 3D spatial dimension xyz and 1D temporal dimension $t$ ). Moreover, the authors implemented a new feature descriptor, called multichannel orientation histogram (MCOH), which is capable of encoding both colour and depth information, and adapts the support region size to the visual linear perspective variations. Before extracting features, three major noise sources are removed: color-depth misalignment, fixed by adjusting the intrinsic parameters of the camera; improper auto white balance, fixed by applying the histogram equalization over the RGB images to reduce the white balance fluctuation; and depth sensing defect, adjusted by using the erosion and dilation operators to remove noisy pixels and small structures in depth images. Then, a hole filling is applied to black regions, by using a morphological reconstruction [94], in order to estimate the depth for those pixels with missing depth values. At the end of the process, the image gradients of the color-depth patch regions are computed and quantized by using a spherical coordinate-based method to form a final feature vector. Finally, in order to recognize the activities, the BoW and SVM methods are used. A limit of this approach is that, depending on the environment, the weight of the color-depth mixture must be properly tuned to achieve a good recognition performance.

Wang and Wang [95] propose a two-stream recurrent neural network (RNN) model, based on long short-term memory (LSTM) units [96], in order to classify different actions by processing both temporal dynamics and spatial configurations of 3D skeleton joints. The temporal stream, consisting of joints coordinates at different time steps, tracks the movements of such joints. The spatial stream, instead, by casting the spatial graph of articulated skeletons into a sequence of joints, displays a spot of the visual form of skeletal data. To model the temporal dynamics, the authors examined two different LSTM architectures: a stacked one and a hierarchical one. The hierarchical LSTM structure – in which the skeleton given as input is divided into five sub-parts – was found to be more efficient and suitable to process the motions of the different sub-parts as well as the whole body. The simple stacked LSTM was used to model the spatial configurations. Because of the two-stream structure of the whole model, the combination of softmax class posteriors from the two streams results in the final action class. The limit of this method is the lack of an informative joints selection mechanism. Not all joints are meaningful in the action analysis process, and useless ones can introduce noise leading to a worse performance for the action recognition system.

In order to overcome the limits of previous works, Liu et al. [97] describe a method to recognize actions based on 3D skeletal data and a global context-aware attention LSTM (GCA-LSTM) network. The GCA-LSTM is a new two-layer LSTM model designed to identify informative skeleton joints for a specific action class frame-by-frame, by using global contextual information. Given the skeleton sequence as input, the first layer produces the initial global context information memory. Such memory is used by the second layer, for each frame, to select informative skeleton joints and to refine the global contextual information. Therefore, the context information is fed to the network at all steps and refined progressively. In this way, if a new input is important for the action analysed, the network saves more information; otherwise, the network blocks it. Finally, when the skeleton sequence is completed, the global contextual information is given as input to a softmax classifier for action class prediction. The limit of this method is related to the number of attention mechanism iterations. In fact, the authors experienced that too many iterations led to performance degradation.

Yang et al. [98] describe a novel convolutional neural network (CNN) [99] based method with attention mechanism for human action recognition in videos. Given 3D skeletal data, the attention mechanism is used to select important joints according to the specific action. The skeleton is represented by using a depth-first tree traversal order rather than by chaining joints with a fixed order. This should allow the semantic meaning of skeleton images and structural information to be better preserved. The authors propose a global long sequence attention network (GLAN) to model spatio-temporal key stages and filter out unreliable joint predictions. Moreover, they introduce a sub-sequence attention network (SSAN) to adjust spatio-temporal aspect ratios and better learn long-term dependencies. This method is limited in handling incomplete or inaccurate estimated poses.

Focusing on the on-line action recognition problem, Liu et al. [100] introduce a framework based on 3D skeleton sequence streams combined with deep learning strategy. A dilated version of a CNN, called scale selection network (SSNet), is used to model the motion dynamics in temporal dimension by exploiting a sliding window. This type of network learns the temporal window scale for each time step dynamically, rather than using a fixed scale window. Besides recognizing the action class, in order to identify the performed part of the ongoing action, the proposed method regresses the temporal distance to the starting point of the current action instance. In this way, at the next temporal step (i.e., next frame), this value can be used as the temporal window scale for action label prediction, trying to suppress the possible incoming interference from previous actions. The convolution layers in the temporal dimension are useful to model the motion dynamics within each perception window, such that in SSNet different layers correspond to different temporal scales. Therefore, at each time step, the convolutional layer is selected, covering the most similar window scale regressed by its previous step. The activated layer is then used for action recognition. The limit of this method is that the average action prediction accuracy decreases in ending stage of some action instances.

In their recent work, Bourouis et al. [101] propose a traffic monitoring system based on a 3D car model recognition. They developed a Bayesian inference-based framework by exploiting scaled Dirichlet mixture models. The authors chose the Bayesian learning to deal with uncertainty by introducing prior information on parameters. Generally, the methods implemented for traffic monitoring require useful approaches for data clustering and modelling. Moreover, they should have good generalization capabilities and maintain a large amount of information about the data. Relying on the idea that the detection and tracking of vehicles can lead to achieve better performance in traffic control, the authors – for the proposed method – firstly focus on car detection and tracking and, secondly, on traffic scene monitoring. For car detection, initially, SIFT vectors are extracted from images representing cars using the difference-of-Gaussians (DOG) interest point detector [102]. Subsequently, such vectors are quantized through the k-means clustering in order to construct a visual vocabulary, then each object representing a car is represented by a frequency histogram over the constructed visual words. Finally, a probabilistic latent semantic analysis (pLSA) is applied in order to represent all images, and Bayesian classifiers are applied to identify cars by following Bayes’s decision rule. For the tracking process, the authors apply the weighted scaled Dirichlet mixture models after normalizing the pixels values. Since this approach is based on car detection, tracking and recognition, it can be limited as it may miss useful visual features.

More recently, Zhang et al. [103] propose a method for skeleton-based human action recognition in videos, enhancing the features representation capability. This enhancement is obtained by introducing a semantic-guided neural network (SGN) that exploits high-level semantics of joints. Given a skeleton sequence, each joint is described by its type and frame index together with 3D coordinates and velocity. All the information are processed by the proposed network consisting of a joint-level module and a frame-level module. The former, based on graph convolutional network (GCN) [104], used to model the correlations among joints in the same frame. The latter, based on CNNs, used to handle correlations across frames by merging the information of all joints in a frame. Finally, the action classification is performed with the last softmax layer. Essentially, the cooperation of different body parts is required to perform an action, but only some of them play key roles. This method at the joint-level stage enables adaptive graph construction to model the skeleton but lacks indicators of centrality to identify key information for each action.

4.2 Methods based on image description features

This section presents some recent key works that use image description features for event recognition.

4.2.1 SIFT/SURF and canny ED

Cheng et al. [51] propose an event recognition method to model temporal dependencies in the data, at a sub-event level, without using event annotations. In detail, an input video is divided into segments with a fixed length. From each video segment, several features are extracted by using the BoW method on the motion SIFT (MoSIFT) [105] features. The latter differs from the original SIFT in the descriptor, which combines the standard SIFT descriptor with the HOF approach. Subsequently, the video segments are grouped with the k-means algorithm in visual words, and a word is assigned to each one. In this way, a video is represented as a visual words sequence. Since a visual word of an event can be statistically correlated to a visual word of another event distant in time, a sequence memorizer (SM) [106] – that is a hierarchical non-parametric Bayesian model without depth constraints – is used to model these interactions. In particular, the SM is applied to each visual word in order to obtain a more robust and efficient approach. As to the classification of events, several multi-class SVMs are trained for each type of event. Finally, to identify the fine-detailed temporal structures in data, the number of clusters used to generate the visual sequences increases with the complexity of the scene. Unfortunately, this value must be determined empirically in the experiments performed.

Wang and Ji [66, 67] propose a deep hierarchical context model for event recognition. In particular, the model learns the features, semantics and prior aspects of a context. With regard to features, the appearance and interaction features are extracted from the event neighbourhood (i.e., area around the event bounding box). In particular, the appearance features capture the appearance of non-target objects within the event neighbourhood. This capture is performed by using the SIFT key points and the related descriptors, which are then encoded with the BoW method. The interaction features capture the interactions between the event objects and the contextual objects. The SIFT key points are also used to detect the event objects. Then, the k-means algorithm is used with the key points in both the event bounding box and the event neighbourhood to generate a dictionary. The dictionary is represented by a 2D histogram, used to capture the co-occurrence frequency of its words, which are both inside and outside the event bounding box among frames. The semantic level captures the interactions among events, persons and objects. These interactions are modelled as a network composed of three layers. The top layer is a label vector that provides information on which event belongs to which class. The bottom layer consists of two vectors, one for objects and one for persons. The middle layer connects the object, person and event units to capture the interaction between them. Finally, the prior level uses two types of prior contexts: the scene priming and the dynamic cueing. The former refers to scene information, such as location (e.g., parking lot or shop) and time (e.g., noon or dark), used to dictate whether certain events occur. Instead, the dynamic cueing provides temporal information for the current event prediction, given a previous event. The events are identified by using a six layers model, which contains units relating to target and contextual measurements, person and object observation, event and context features, interactions, event labels, scene states and scene observation. A limit of this model is that it is designed for two interacting entities, typically one person and one object, and it cannot be directly applied to other scenarios where more than two entities are involved.

Ribeiro et al. [107] describe the use of a deep convolutional autoencoder (CAE) [108] for the anomaly detection task, with the aim to capture the 2D structure of image sequences during the learning stage. The working idea is that a CAE is capable of learning regular events in videos, and the reconstruction error of each frame can be used as an anomaly score. The anomaly recognition problem is presented as a binary classification task: one class is human-defined, whereas the other one is an anomaly class. The proposed method uses high-level appearance or motion features combined with input frames and, subsequently, the authors evaluate what type of features, aggregated to the input data, performs better. Appearance features are extracted by using the Canny ED to detect discontinuities (i.e., set of points), where image brightness changes sharply, organized as edges. Motion features, extracted through the optical flow, are used to describe displacement or speed related to the distance that a pixel covers between two subsequent frames. Features and frames are mixed to generate different scenes as input data for the CAE. The events therein are classified by using the reconstruction error. A limit of this approach is that the more spatially complex is the video, the harder it is for the CAE to classify anomalies. According to the study, appearance features are more efficient for the task, rather than the motion ones, when combined with input data.

4.2.2 HOG and HOF

The work proposed by Wu and Hu [34] is focused on capturing crucial motion patterns by employing latent models for human action recognition. In detail, an event is defined for each class of action. These events are modelled from a temporal segment of the video by using the temporal pyramid model (TPM) [109], which allows to extract low-level features from the video as dense trajectories [110]. These are obtained by tracking and sampling the points of the trajectories on multiple spatial scales. The trajectories are represented with five descriptors: the shape of the trajectory, the HOG, the HOF and the motion boundary histogram (MBH) [111]. In order to detect the crucial motion patterns, the authors used a multi-class SVM with latent variables solved iteratively by using the concave-convex procedure [112]. An issue of this method is that the classification performance strongly depends on the optimization parameters.

Chen and Zhang [50] use another type of low-level features: the improved trajectories features (ITF) [113], which are an improved version of the dense trajectories for action recognition. The feature descriptors remain the same as [34], but the difference is in isolation and in removing the background trajectories and camera movements. First of all, the five descriptors are computed by using RANSAC [114] and by using SURF to compute the homography between two consecutive frames. Then, the optical flow is warped with the estimated homography, and motion descriptors are computed on the warped optical flow. Once the feature descriptors are extracted, their dimensionality is reduced by the principal component analysis (PCA) [115] to speed up the computation, and then they are encoded with the Fisher vector (FV) [116] to improve their representation. The next step consists in constructing the clusters of the trajectories that could be part of an action. This is done by adopting the hierarchical divisive clustering algorithm [117]. Besides, in this work, the SVM is used to recognize events. A limit of this method is that it does not allow the real-time processing of the video.

Wang and Snoussi [52] introduced the histogram of optical flow orientation (HOFO) [118] as a feature descriptor for abnormal event recognition in videos. First of all, the HOF is used to extract the low-level features. Next, the HOFO descriptor is computed over dense and overlapping grids of spatial blocks, with optical flow orientation features extracted at fixed resolution and gathered into a high dimensional feature vector to represent the movement information of the frame. The descriptor is calculated at each block, and then it is accumulated into one global vector for each frame. The analysis of the HOFO blocks allows the authors to model the interaction among the motions of the local blocks. The classification of events is performed with a one-class SVM, which takes the HOFO descriptors as input to obtain the support vector used for the on-line frame classification. An event is identified as abnormal if it is detected over a fixed number of frames. However, if the size of the block is too small, the SVM is not robust and can classify an abnormal situation when a movement occurs in an empty environment.

Khan et al. [119] present a solution to detect anomalies in a crowd. The video sequences in the input are processed frame-by-frame, and each frame is divided into super-pixels. For each resulting super-pixel, a HOF is extracted to model the dominant motion direction. Subsequently, the histograms from consecutive frames are combined to obtain the final features for the video. For the anomaly detection task, the authors propose a univariate Gaussian discriminant analysis with the k-means algorithm combined with a linear SVM for classification. In the presence of redundant information, like in most real surveillance videos with an event of interest occurring very rarely, some false positives are expected.

4.2.3 Deep features

Ijjina and Chalavadi [54] present a key pose based method to recognize actions by using motion features computed from RGB videos and depth sequences. The authors chose motion features as they are useful to emphasize human activities in different temporal regions, improving discrimination among different actions. In their work, the authors use a CNN-based model to extract motion features, referred to as convnet features. The temporal templates, used to take the motion sequence in a single image, are the classical MHI and motion energy image (MEI) computed as the weighted sum of motion data in a video, where the differences between frames are used to compute the motion among them. In detail, temporal templates are computed separately for both RGB and depth streams and, subsequently, given as input to two different CNNs for motion feature extraction. The features obtained are the inputs of an ELM classifier used to predict the action class for the video. The limit of this method is that the temporal templates are sensitive to the angle of view; therefore, the presented approach can be less efficient if used on unconstrained videos.

Yeung et al. [55] describe a method capable of capturing multiple simultaneous actions, within the same video sequence, by annotating the video with multiple dense labels. Moreover, the authors handle temporal interactions between consecutive actions with a new multiLSTM model that features temporally-extended input and output connections. The design of their innovative LSTM model allows to refine the predictions in retrospect, namely after processing more and more frames in input from the beginning. Even if the same temporally-extended features can be achieved using a bi-directional RNN (BRNN) [120], the proposed LSTM improvement can be used in on-line settings with short-time lags. Just as for the previous work, the features are convnet features, and they are extracted with the VGG 16-layers CNN [121] pre-trained on ImageNet [122] and fine-tuned, on a single frame level, on the proprietary dataset. The resulting features are the inputs for the multiLSTM network used to annotate the video densely. The limit of this method is related to the soft attention input-output temporal context mechanism implemented for dense labelling. It does not improve the output predictions significantly, due to the increased fragility when the attention is close to the output, without network layers in the middle to add robustness.

Girdhar et al. [123], instead, propose a network based on both context information and attention mechanisms for action recognition. Their architecture is similar to the faster R-CNN, with base and head networks. The former extracts features and region of proposals for the humans present in the video, by using 3D convolution layers. The latter aggregate contextual information, predict the actions and regress a bounding box, by using the resulting features of each region. Besides, the implemented attention mechanisms learn to emphasize hands and faces, which can be crucial to discriminate an action, without an explicit supervision other than boxes and class labels. The limit of the proposed network architecture is related to the number of humans detected, as the authors found a decreasing performance when adding people to the scene.

As resulting from psychophysical experiments, humans can recognize and predict events from videos by observing hand movements during the preparation and execution of actions (i.e., before and after contact with an object). It seems that the visual information in the initial steps of an action is sufficient for the observers to understand it. Based on the idea that humans constantly update their belief concerning observed and future events, Fermüller et al. [56] propose a method capable of classifying dexterous actions in terms of dynamics and forces, also predicting the effects of forces on objects. Therefore, the proposed method is composed of two tasks: the first one is the action prediction from videos; the second one is the estimation of the tactile signal of the action. As to action prediction, visual features are extracted from the video by using a pre-trained VGG 16-layers CNN and, subsequently, they are given as input to an LSTM model. As to the recognition of an ongoing action, given the video sequence, the classification is updated frame-by-frame, instead of an action label to the whole video. As to the tactile signal, the authors use an LSTM model modified to estimate the hand forces for each frame; therefore the given input data are video sequences with force measurements as ground truth. Just as for the action prediction task, the visual features from the video are obtained with a pre-trained CNN applied to image patches. The limit of the presented model is that it performs worse with objects having large variations in the way different subjects can move them.

Recently, Soltanian and Ghaemmaghami [39] exploited transfer learning by applying image trained descriptors at the frame level of video sequences. The objective is to make the event recognition task possible in scenarios with limited computational resources. The authors propose a CNN-based event recognition method leveraged on both the hierarchical CNN concept scores post-processing and the concept-wise power-law normalization. A CNN, pre-trained on ImageNet, is fine-tuned using a subset of training video frames in order to obtain CNN frame descriptors. First of all, the resulting descriptors are post-processed to improve the mid-level video representation, taking into account the hierarchy and relative shortest distance of concepts in the WordNet concept tree. Secondly, a new concept-wise power-law normalization – in which different normalizations are applied to different features according to their statistics on training data – is used to improve recognition accuracy. Since the method uses only spatial descriptors, the limit of the proposed method comes up when temporal cues become significant for the event recognition. Lee and Lee [124], focusing on cost efficiency for practical applications, propose a complex human activity recognition method, suitable for partially observed videos, based on pre-trained deep representation. The underlying idea is that if a video is partially observed, then a good representation of the given video is more important than the temporal dynamics of the actual activity. In detail, the complex ongoing action is represented using publicly pre-trained CNNs to extract frame-level image features, while considering the pairwise interactions between individuals and their participation ratio in the overall scene. The relationship between objects is constructed by exploiting not only the appearance of the object, but also both the local and global motion activations. Once the descriptor is created, an SVM is used to predict the activity label. Experiments have shown that the proposed method works better on the interaction between moving objects rather than with non-moving ones.

Differently, Shri and Jothilakshmi [168] exploit the CNN for crowd video event classification. They propose the use of a CNN for frame-by-frame video content exploration and deep feature extraction. In this way, the crowd event classification can be performed based on key-frames improving runtime performance at low cost. The CNN was trained on 4.000 frames of only four categories. The method needs to be tested on larger crowd event datasets.

Pang et al. [175], instead, propose a self-trained deep ordinal regression network for unsupervised anomaly detection in videos, consisting of a CNN for feature extraction and a fully connected neural network for anomaly scoring. Applying the self-trained ordinal regression allows the use of a weakly supervised strategy leading to the end-to-end learning. The method is weakly supervised because the video frames are initially pseudo-classified, as normal or abnormal, by using generic unsupervised anomaly detectors, and the CNN is a ResNet-50 [183] model pre-trained on relevant auxiliary data. The pseudo-classified frames are the input for the proposed network, that iteratively updates and enhances itself by recomputing anomaly scores and updating pseudo-classification accordingly, in a self-training fashion. However, this method does not consider motion information, then the set of detectable anomalies is limited.

Also, Doshi and Yilmaz [177] recently focused on unsupervised anomaly detection, but for traffic monitoring applications. Given a traffic video, they consider the presence of stationary vehicles an anomalous event. The proposed method consists of three main modules: preprocessing module, candidate selection module, and backtracking anomaly detection module. The preprocessing stage detects and classifies stationary objects in the video by using the you only look once (YOLO) [184] model pre-trained on MS-COCO dataset. Because they are interested only on vehicles, the candidate selection stage aims to remove the misclassified objects, by using the nearest neighbour approach, and select potential anomalous regions, by using a clustering-based strategy. Finally, the backtracking anomaly detection stage computes the structural similarity (SSIM) measurement of each region of interest between its onset time and each time instance from the start of the video. If the SSIM is high, then a stationary vehicle appears in the frame. If this measure increases over several frames, then the time instance where the SSIM crosses a certain threshold is declared as the onset time of the anomaly. Actually, this method is not capable of learning different type of anomalies.

Zhang et al. [179] exploit the end-to-end learning and weakly supervised scheme for collective activity recognition. The method is weakly supervised because it uses only the bounding boxes of people detected in the videos and activity labels. The authors propose a fast deep learning architecture that jointly models the person detection and collective activity recognition tasks, reinforcing each other. Indeed, such tasks share the feature extraction part, aiming to reduce the computational cost and remove people not involved in the activity execution. Therefore, the network architecture consists of two main components: person detection and collective activity reasoning. The former is a custom version of the region proposal network (RPN) introduced in Faster R-CNN [185] that returns region of interests representing the persons detected and scene bounding boxes. The latter uses the visual features of the people and scene as the latent variable embeddings, updating them by using a mean-field like procedure to capture more abundant person interactions classified with a softmax function. This method exploits only visual information to perform the classification, but also motion information could be useful to improve collective activity recognition.

5. Discussion

This section provides observations on types of features and machine learning algorithms currently used for the automatic event recognition task. Moreover, it contains two summary tables that provide a concise overview of some central aspects (i.e., features, algorithms and datasets) concerning the works in Section 4. The discussion aims to help the reader choose the best combination between features and algorithms to develop a specific ER system.

By analysing the most important and recent works regarding the current state-of-the-art event recognition, it is possible to observe that trajectories and motion vectors are used to manage a wide range of events. These features can be computed suitably and easily on almost every type of moving subjects (e.g., people, vehicles or animals), without high computational costs. Minimizing computational costs is very important for an ER system implementation, especially if it is designed for real-time use, such as to monitor public (e.g. streets or squares) and private (e.g. military zones or warehouse) areas. However, even if the analysis of trajectories or motion features are widely used to understand automatically events in which rigid-subjects are involved (e.g., vehicles); for some applications could be interesting to analyse such subjects also granularly (e.g., fault tolerance test). In general, when an ER system is designed for non-rigid subjects, it is a good idea to consider the information of both the whole subject and the set of its sub-parts. The greater availability of information, concerning the movements of non-rigid subjects, increases the set of events that can be recognized, but also the complexity of the recognition task. If, for example, an ER system is only capable of processing information about the whole body of one subject, only a small set of simple events can be recognized. Processing all the information concerning the interaction among more subjects, instead, allows to recognize different types of events, including the type of relationship between two subjects and the attitude in a crowd of people. The increasing complexity of the ER system, aiming to process sub-parts of the human body (e.g. hands or arms), allows to classify a greater number of subject behaviours, including the classification of emotional states (e.g., anxiety, happiness or fear). Based on the abovementioned observations, in recent years there has been an increasing interest in depth sensors technology, through which it is possible to obtain the skeleton of human subjects easily. The possibility to work with dynamic information (e.g., position, speed or acceleration) of specific points of the human body (i.e., skeleton joints), is a real step forward in the automatic interpretation of human behaviours. From a technical point of view, it is not of particular interest to identify which kind of sensors have been used to obtain the skeleton model of the subjects that act within an area of interest, since currently this task can be achieved by different technologies, including RGB cameras. However, during the implementation of a real ER system, several aspects and related limits have to be considered. For example, a single depth sensor works well in an indoor environment, and the information acquired is sufficient to create the skeleton model of the subjects involved. On the other hand, two (or more) depth sensors cannot be placed facing each other, because this would generate interferences. Besides, in general, the spatial resolution of these sensors is lower compared to the RGB sensors. Finally, the consumer depth sensors commonly reach a weak performance in outdoor environments due to their sensitivity to solar light. Differently, the RGB sensors work well in outdoor environments but, in general, they require calibration phases and the reconstruction of skeletons that, in real contexts, are neither simple nor granted. However, the selection of more suitable sensors is a matter that depends on several factors, including the properties of the scene, the type of events that requires to be recognized, amount of subjects, and so on.

Regarding the image descriptors, in recent years, their use for event recognition has grown considerably. This is due to the fact that most of the image descriptors on RGB videos are invariant to scale, rotation, translation and, partially, to illumination changes. Another advantage of image descriptors is that, with the modern hardware, they can be computed in real-time, allowing the design and implementation of reactive ER systems. Besides, they can be used to recognize any type of subject and extract several types of features (e.g., trajectories, patterns or salient points). Some drawbacks of image descriptors are related to the quality of the processed images and the matching algorithm. Therefore, the image descriptors approach can work correctly only if the source images have high spatial resolution without significant noises or distortions. Moreover, the source images should represent elements – for both foreground and background – with rich details in order to build a robust set of descriptors. Since each descriptor has a very complex structure, the matching process, aimed to identify correctly each couple of descriptors between two consecutive video frames, could be affected by different errors. In any case, this approach can achieve remarkable results through an accurate pre-processing phase and proper use of the descriptors, by taking into account the specific goals of the application context.

To perform event recognition notice that, in the reported works, the most used algorithms are clustering, SVM and neural networks, while the most used features are 3D data and deep ones. The clustering approach is generally used to associate the actions identified, or in some cases, the estimated poses of subjects, to the ones previously learned by the ER system during the training step. The SVM approach is usually applied both in single and in multi-class methods. Besides, it is also routinely used to check if the movements identified in the scenario trigger one or more events. Indeed, there has been an increase in the use of neural networks, as this type of approach can be applied for binary or multi-class classification tasks, and in a supervised or an unsupervised way. In the supervised learning paradigm, neural networks require a large amount of training data to obtain a good complex model trained to recognize events. In the case of skeleton-based methods, neural network models usually seem a natural choice for the processing of motion dynamics in the temporal domain and joints dependencies in the spatial domain.

Table 2 reports the types of recognizable events with reference works and datasets used for evaluation. As shown, the MSR collection and NTU RGB+D datasets are the most used by the approaches that make an extensive use of 3D data. Instead, classical approaches that process RGB videos, typically use the UCF50 and HMDB51 datasets. Finally, Table 3 correlates application fields, detectable subjects, used features, and applied machine learning algorithms. This table is of particular interest because it highlights the set of features and algorithms used to treat several application fields and, for each of them, it highlights a set of selected works that can be considered immediately by the reader in order to explore the related topic. Moreover, useful information is given by the connection between the set of features and the related set of machine learning algorithms used.

6. Conclusions

In recent years, the increased use of video sensor networks has promoted the development of smart algorithms capable of understanding the semantic content of videos. These algorithms aim to understand automatically the sequence of actions performed by different subjects within the scene observed. These actions represent simple or complex events, depending on the type of interaction among the subjects involved. This survey provides formal definitions of scene and event, reasoning on event effects, and the description of a logical architecture for a generic ER system. Moreover, it describes the state-of-the-art works that, in our opinion, are cutting-edge in the event recognition task. It also presents two taxonomies used to describe the selected key works concerning the event recognition task. The first taxonomy classifies the types of features used to describe events, while the second one classifies the machine learning algorithms used to recognize events. This survey even highlights the different limits of the works reported and provides the list of datasets used to evaluate the ER systems. Finally, the last two tables summarize some key aspects of the ER methods discussed.

Footnotes

Acknowledgments

This work was partially supported by both the ONRG project N62909-20-1-2075 “Target Re-Association for Autonomous Agents” (TRAAA) and MIUR under grant “Departments of Excellence 2018–2022” of the Department of Computer Science of Sapienza University.

References

Hamreras

Boucheham

Molina-Cabello

Benítez-Rochel

López-Rubio

. Content based image retrieval by ensembles of deep learning object classifiers. Integrated Computer-Aided Engineering. 2020; 27(3): 317–331.

Liang

. Image-based post-disaster inspection of reinforced concrete bridge systems using deep learning with Bayesian optimization. Computer-Aided Civil and Infrastructure Engineering. 2019; 34(5): 415–430.

Guo

Polanía

Zhu

Boncelet

Barner

. Graph Neural Networks for Image Understanding Based on Multiple Cues: Group Emotion Recognition and Event Recognition as Use Cases. In: IEEE Winter Conference on Applications of Computer Vision (WACV); 2020. pp. 2910–2919.

Yan

Zhang

Xie

. An optimizer ensemble algorithm and its application to image registration. Integrated Computer-Aided Engineering. 2019; 26(4): 311–327.

Sovetkin

Steland

. Automatic processing and solar cell detection in photovoltaic electroluminescence images. Integrated Computer-Aided Engineering. 2019; 26(2): 123–137.

Mishra

Piciarelli

Foresti

. A neural network for image anomaly detection with deep pyramidal representations and dynamic routing. International Journal of Neural Systems. 2020; 30(10): 2050060.

Thurnhofer-Hemsi

López-Rubio

Roé-Vellvé

Molina-Cabello

. Multiobjective optimization of deep neural networks with combinations of lp-norm cost functions for 3D medical image super-resolution. Integrated Computer-Aided Engineering. 2020; 27(3): 233–251.

Leming

Górriz

Suckling

. Ensemble deep learning on large, mixed-site fMRI datasets in autism and other tasks. International Journal of Neural Systems. 2020; 30(7): 2050012.

Hua

Wang

Liu

Khalid

. A novel method of building functional brain network using deep learning algorithm with application in proficiency detection. International Journal of Neural Systems. 2019; 29(1): 1850015.

10.

Feng

Halm-Lutterodt

Tang

Mecum

Mesregah

, et al. Automated mri-based deep learning model for detection of alzheimer’s disease process. International Journal of Neural Systems. 2020; 30(6): 2050032. PMID: 32498641.

11.

Lozano

Suárez

Soto-Sánchez

Garrigós

Martínez-Alvarez

Ferrández

, et al. Neurolight: a deep learning neural interface for cortical visual prostheses. International Journal of Neural Systems. 2020; 30(9): 2050045.

12.

Luo

Yang

Cao

. Capturing and understanding workers’ activities in far-field surveillance videos with deep action recognition and bayesian nonparametric learning. Computer-Aided Civil and Infrastructure Engineering. 2019; 34(4): 333–351.

13.

Shin

Cho

. 3D-convolutional neural network with generative adversarial network and autoencoder for robust anomaly detection in video surveillance. International Journal of Neural Systems. 2020; 30(6): 2050034.

14.

Kulkarni

Jadhav

Adhikari

. In: A Survey on Human Group Activity Recognition by Analysing Person Action from Video Sequences Using Machine Learning Techniques. Springer Singapore; 2020. pp. 141–153.

15.

Shi

Zhang

. A novel unsupervised approach to discovering regions of interest in traffic images. Pattern Recognition. 2015; 48(8): 2581–2591.

16.

Luo

Zhou

Cao

. Combining deep features and activity context to improve recognition of activities of workers in groups. Computer-Aided Civil and Infrastructure Engineering. 2020; 35(9): 965–978.

17.

Cai

Xue

. Self-adapted optimization-based video magnification for revealing subtle changes. Integrated Computer-Aided Engineering. 2020; 27(2): 173–193.

18.

Zhang

Rajan

Story

. Concrete crack detection using context-aware deep semantic segmentation network. Computer-Aided Civil and Infrastructure Engineering. 2019; 34(11): 951–971.

19.

Zhang

Chen

. Zernike-moment measurement of thin-crack width in images enabled by dual-scale deep learning. Computer-Aided Civil and Infrastructure Engineering. 2019; 34(5): 367–384.

20.

Benito-Picazo

Domínguez

Palomo

López-Rubio

. Deep learning-based video surveillance system managed by low cost hardware and panoramic cameras. Integrated Computer-Aided Engineering. 2020; 27(4): 373–387.

21.

Xie

Sundaram

Campbell

. Event mining in multimedia streams. Proceedings of the IEEE. 2008; 96(4): 623–647.

22.

Jaad

Abdelghany

. Modeling urban growth using video prediction technology: a time-dependent convolutional encoder-decoder architecture. Computer-Aided Civil and Infrastructure Engineering. 2020; 35(5): 430–447.

23.

Micheloni

Snidaro

Foresti

. Exploiting temporal statistics for events analysis and understanding. Image Vision Computing. 2009; 27(10): 1459–1469.

24.

Lai

Liu

Chen

Chang

. Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences. In: Proceedings of the 13th European Conference on Computer Vision; 2014. pp. 675–688.

25.

Nurwidyantoro

Winarko

. Event detection in social media: A survey. In: International Conference on ICT for Smart Society; 2013. pp. 1–5.

26.

D’Orazio

Guaragnella

. A survey of automatic event detection in multi-camera third generation surveillance systems. International Journal of Pattern Recognition and Artificial Intelligence. 2015; 29(1): 1–29.

27.

Zhang

Zhong

Lei

Yang

, et al. A comprehensive survey of vision-based human action recognition methods. Sensors. 2019; 19(5): 1005.

28.

Artikis

Sergot

Paliouras

. An event calculus for event recognition. IEEE Transactions on Knowledge and Data Engineering. 2015; 27(4): 895–908.

29.

Kowalski

Sergot

. A logic-based calculus of events. New Generation Computing. 1986; 4: 67–95.

30.

Toyama

Hager

. A two level approach for scene recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2005. pp. 688–695.

31.

Stauffer

Grimson

WEL

. Adaptive background mixture models for real-time tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 1999. pp. 246–252.

32.

You

Tan

Kawakami

Mukaigawa

Ikeuchi

. Adherent raindrop modeling, detectionand removal in video. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2016; 38(9): 1721–1733.

33.

Tripathi

Mukhopadhyay

. Efficient fog removal from video. Signal, Image and Video Processing. 2014; 8: 1431–1439.

34.

. Learning effective event models to recognize a large number of human actions. IEEE Transactions on Multimedia. 2014; 16(1): 147–158.

35.

Mahadevan

Bhalodia

Vasconcelos

. Anomaly Detection in Crowded Scenes. In: IEEE Conference on Computer Vision and Pattern Recognition; 2010. pp. 1975–1981.

36.

Moayedi

Azimifar

Boostani

. Structured sparse representation for human action recognition. Neurocomputing. 2015; 161: 38–46.

37.

Luo

Wang

. Spatio-temporal feature extraction and representation for RGB-D human action recognition. Pattern Recognition Letters. 2014; 50: 139–148.

38.

Zhang

Parker

. CoDe4D: color-depth local spatio-temporal features for human activity recognition from RGB-D videos. IEEE Transactions on Circuits and Systems for Video Technology. 2016; 26(3): 541–555.

39.

Soltanian

Ghaemmaghami

. Hierarchical concept score postprocessing and concept-wise normalization in CNN-based video event recognition. IEEE Transactions on Multimedia. 2019; 21(1): 157–172.

40.

Fan

Tian

Wang

Ming

Shi

Jin

. 3D human behavior recognition based on spatiotemporal texture features. In: Proceedings of the 8th International Conference on Human System Interaction; 2015. pp. 350–356.

41.

Munaro

Fossati

Basso

Menegatti

Van Gool

. One-Shot Person Re-identification with a Consumer Depth Camera. In: Gong

Cristani

Yan

Loy

, eds. Person Re-Identification. London: Springer London; 2014. pp. 161–181.

42.

Sun

Zhao

Gao

. Modeling and recognizing human trajectories with beta process hidden Markov models. Pattern Recognition. 2015; 48(8): 2407–2417.

43.

Denman

Reddy

Fookes

Sridharan

. Real-time video event detection in crowded scenes using (MPEG) derived features: a multiple instance learning approach. Pattern Recognition Letters. 2014; 44: 113–125.

44.

Guo

Feng

. Spatio-temporal context analysis within video volumes for anomalous-event detection and localization. Neurocomputing. 2015; 155: 309–319.

45.

Chen

Clarke

Giuliani

Gaschler

Knoll

. Combining unsupervised learning and discrimination for 3D action recognition. Signal Processing. 2015; 110: 67–81.

46.

Slama

Wannous

Daoudi

Srivastava

. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recognition. 2015; 48(2): 556–567.

47.

Chen

Zhang

Hou

Jiang

Liu

Yang

. Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features. Multimedia Tools and Applications. 2017; 76: 4651–4669.

48.

Ming

Wang

Fan

. Uniform local binary pattern based texture-edge feature for 3D human behavior recognition. PLoS ONE. 2015; 10(5): 1–19.

49.

Canny

. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1986; 8(6): 679–698.

50.

Chen

Zhang

. Cluster trees of improved trajectories for action recognition. Neurocomputing. 2016; 173: 364–372.

51.

Cheng

Fan

Pankanti

Choudhary

. Temporal Sequence Modeling for Video Event Detection. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014. pp. 2235–2242.

52.

Wang

Snoussi

. Detection of abnormal visual events via global optical flow orientation histogram. IEEE Transactions on Information Forensics and Security. 2014; 9(6): 988–998.

53.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition; 2005. pp. 886–893.

54.

Ijjina

Chalavadi

. Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recognition. 2017; 72: 504–516.

55.

Yeung

Russakovsky

Jin

Andriluka

Mori

Fei-Fei

. Every moment counts: dense detailed labeling of actions in complex videos. International Journal of Computer Vision. 2018; 126: 375–389.

56.

Fermüller

Wang

Yang

Zampogiannis

Zhang

Barranco

, et al. Prediction of manipulation actions. International Journal of Computer Vision. 2018; 126: 358–374.

57.

Guo

Liu

Oerlemans

Lao

Lew

. Deep learning for visual understanding: a review. Neurocomputing. 2016; 187: 27–48.

58.

Vishwanathan

SVN

Sun

Theera-Ampornpunt

Varma

. Multiple Kernel Learning and the SMO Algorithm. In: Proceedings of the 23rd International Confrence on Neural Information Processing Systems. Vol. 2; 2010. pp. 2361–2369.

59.

Freund

Schapire

. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 1997; 55(1): 119–139.

60.

Maron

Ratan

. Multiple-Instance Learning for Natural Scene Classification. In: Proceedings of the 15th International Conference on Machine Learning; 1998. pp. 341–349.

61.

Attal

Boubezoul

Oukhellou

Espi

. Powered two-wheeler riding pattern recognition using a machine-learning framework. IEEE Transactions on Intelligent Transportation Systems. 2015; 16(1): 475–487.

62.

Cai

Wang

Chen

Jiang

. Trajectory-based anomalous behaviour detection for intelligent traffic surveillance. IET Intelligent Transport Systems. 2015; 9(8): 810–816.

63.

Schmidhuber

. Deep learning in neural networks: an overview. Neural Networks. 2015; 61: 85–117.

64.

Goodfellow

Bengio

Courville

. Deep Learning. MIT Press; 2016.

65.

Reyes

Ventura

. Performing multi-target regression via a parameter sharing-based deep network. International Journal of Neural Systems. 2019; 29(9): 1950014.

66.

Wang

. Hierarchical context modeling for video event recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017; 39(9): 1770–1782.

67.

Wang

. Video event recognition with deep hierarchical context model. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015. pp. 4418–4427.

68.

Hjort

. Nonparametric bayes estimators based on beta processes in models for life history data. Annals of Statistics. 1990 09; 18(3): 1259–1294.

69.

Bunke

. Hidden Markov Models: Applications in Computer Vision. World Scientific; 2001.

70.

Gilks

. Encyclopedia of Biostatistics. John Wiley & Sons Inc.; 2005.

71.

Tudor

. MPEG-2 video compression. Electronics & Communication Engineering Journal. 1995; 7(6): 257–264.

72.

Kanungo

Mount

Netanyahu

Piatko

Silverman

. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002; 24(7): 881–892.

73.

Cong

Yuan

Liu

. Sparse Reconstruction Cost for Abnormal Event Detection. In: IEEE Conference on Computer Vision and Pattern Recognition; 2011. pp. 3449–3456.

74.

Zen

Ricci

. Earth mover’s prototypes: A convex learning approach for discovering activity patterns in dynamic scenes. In: IEEE Conference on Computer Vision and Pattern Recognition; 2011. pp. 3225–3232.

75.

Hung

Kulkarni

Kuo

. A new weighted fuzzy c-means clustering algorithm for remotely sensed image classification. IEEE Journal of Selected Topics in Signal Processing. 2011; 5(3): 543–553.

76.

Sivic

Zisserman

. Efficient visual search of videos cast as text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009; 31(4): 591–606.

77.

Mairal

Bach

Ponce

Sapiro

. Online Dictionary Learning for Sparse Coding. In: Proceedings of the 26th International Conference on Machine Learning; 2009. pp. 689–696.

78.

Bobick

Davis

. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001; 23(3): 257–267.

79.

Braspenning

Thuijsman

Weijters

AJMM

. Artificial neural networks: An introduction to ANN theory and practice. Springer-Verlag Berlin Heidelberg; 1995.

80.

Hyvärinen

Hurri

Hoyer

. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer-Verlag; 2009.

81.

Hinton

Osindero

Teh

. A fast learning algorithm for deep belief nets. Neural Computation. 2006; 18(7): 1527–1554.

82.

Zou

Yeung

. Learning Hierarchical Invariant Spatio-temporal Features for Action Recognition with Independent Subspace Analysis. In: IEEE Conference on Computer Vision and Pattern Recognition; 2011. pp. 3361–3368.

83.

Doretto

Chiuso

Soatto

. Dynamic textures. International Journal of Computer Vision. 2003; 51: 91–109.

84.

Srivastava

Klassen

Joshi

Jermyn

. Shape analysis of elastic curves in euclidean spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011; 33(7): 1415–1428.

85.

Chen

Liu

Kehtarnavaz

. Real-time human action recognition based on depth motion maps. Journal of Real-Time Image Processing. 2016; 12: 155–163.

86.

Kobayashi

Otsu

. Image Feature Extraction Using Gradient Local Auto-Correlations. In: Proceedings of the 10th European Conference on Computer Vision; 2008. pp. 346–358.

87.

Kobayashi

Otsu

. Motion recognition using local auto-correlation of space-time gradients. Pattern Recognition Letters. 2012; 33(9): 1188–1195.

88.

Huang

Zhu

Siew

. Extreme learning machine: theory and applications. Neurocomputing. 2006; 70(1–3): 489–501.

89.

Chen

Zhang

Wang

. Land-use scene classification using multi-scale completed local binary patterns. Signal, Image and Video Processing. 2016; 10: 745–752.

90.

Chen

. Local binary patterns and extreme learning machine for hyperspectral imagery classification. IEEE Transactions on Geoscience and Remote Sensing. 2015; 53(7): 3681–3693.

91.

Chen

Zhou

Guo

. Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification. In: IEEE International Conference on Multimedia Big Data; 2015. pp. 324–329.

92.

Barnich

Droogenbroeck

. ViBe: a universal background subtraction algorithm for video sequences. IEEE Transactions on Image Processing. 2011; 20(6): 1709–1724.

93.

Altman

. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician. 1992; 46(3): 175–185.

94.

Gonzalez

Woods

. Digital Image Processing. Prentice Hall; 2007.

95.

Wang

. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In: IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 3633–3642.

96.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Computation. 1997; 9(8): 1735–1780.

97.

Liu

Wang

Duan

Kot

. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 3671–3680.

98.

Yang

Luo

. Action recognition with spatio-temporal visual attention on skeleton image sequences. IEEE Transactions on Circuits and Systems for Video Technology. 2019; 29(8): 2405–2415.

99.

Lecun

Bottou

Bengio

Haffner

. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998; 86(11): 2278–2324.

100.

Liu

Shahroudy

Wang

Duan

Kot Chichung

. Skeleton-based online action prediction using scale selection network. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2019; 46(6): 1453–1467.

101.

Bourouis

Laalaoui

Bouguila

. Bayesian frameworks for traffic scenes monitoring via view-based 3D cars models recognition. Multimedia Tools and Applications. 2019; 78: 18813–18833.

102.

Lowe

. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004; 60: 91–110.

103.

Zhang

Lan

Zeng

Xing

Xue

Zheng

. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.

104.

Kipf

Welling

. Semi-Supervised Classification with Graph Convolutional Networks. In: International Conference on Learning Representations (ICLR); 2017.

105.

Hendaoui

Abdellaoui

Douik

. Synthesis of spatio-temporal interest point detectors: Harris 3D, MoSIFT and SURF-MHI. In: Proceedings of the 1st International Conference on Advanced Technologies for Signal and Image Processing; 2014. pp. 89–94.

106.

Wood

Archambeau

Gasthaus

James

Teh

. A Stochastic Memoizer for Sequence Data. In: Proceedings of the 26th International Conference on Machine Learning; 2009. pp. 1129–1136.

107.

Ribeiro

Lazzaretti

Lopes

. A study of deep convolutional auto-encoders for anomaly detection in videos. Pattern Recognition Letters. 2018; 105: 13–22.

108.

Masci

Meier

Cireşan

Schmidhuber

. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In: Artificial Neural Networks and Machine Learning; 2011. pp. 52–59.

109.

Liu

Shah

. Learning human actions via information maximization. In: IEEE Conference on Computer Vision and Pattern Recognition; 2008. pp. 1–8.

110.

Wang

Kläser

Schmid

Liu

. Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition; 2011. pp. 3169–3176.

111.

Dalal

Triggs

Schmid

. Human Detection Using Oriented Histograms of Flow and Appearance. In: Proceedings of the 9th European Conference on Computer Vision; 2006. pp. 428–441.

112.

Yuille

Rangarajan

. The concave-convex procedure. Neural Computation. 2003; 15(4): 915–936.

113.

Wang

Schmid

. Action Recognition with Improved Trajectories. In: IEEE International Conference on Computer Vision; 2013. pp. 3551–3558.

114.

Fischler

Bolles

. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM. 1981; 24(6): 381–395.

115.

Jolliffe

. Principal Component Analysis. Springer-Verlag; 2002.

116.

Sánchez

Perronnin

Mensink

Verbeek

. Image classification with the fisher vector: theory and practice. International Journal of Computer Vision. 2013; 105: 222–245.

117.

Gaidon

Harchaoui

Schmid

. Activity representation with motion hierarchies. International Journal of Computer Vision. 2014; 107(3): 219–238.

118.

Wang

Snoussi

. Histograms of Optical Flow Orientation for Visual Abnormal Events Detection. In: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance; 2012. pp. 13–18.

119.

Khan

MUK

Park

Kyung

. Rejecting motion outliers for efficient crowd anomaly detection. IEEE Transactions on Information Forensics and Security. 2019; 14(2): 541–556.

120.

Schuster

Paliwal

. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. 1997; 45(11): 2673–2681.

121.

Simonyan

Zisserman

. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv. 2014.

122.

Deng

Dong

Socher

Kai

F-F

. ImageNet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition; 2009. pp. 248–255.

123.

Girdhar

Carreira

Doersch

Zisserman

. Video Action Transformer Network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. pp. 244–253.

124.

Lee

. Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition. 2019; 85: 198–206.

125.

Wang

Grimson

WEL

. Trajectory analysis and semantic region modeling using a nonparametric Bayesian model. In: IEEE Conference on Computer Vision and Pattern Recognition; 2008. pp. 1–8.

126.

Nascimento

Figueiredo

MAT

Marques

. Trajectory classification using switched dynamical hidden markov models. IEEE Transactions on Image Processing. 2010; 19(5): 1338–1348.

127.

Wang

Grimson

WEL

. Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009; 31(3): 539–555.

128.

Deman

Sridharan

Fookes

. SAIVT-QUT@ TRECVid 2012: Interactive surveillance event detection. In: TREC Video Retrieval Evaluation Workshop Proceedings. National Institute of Standards and Technology (NIST); 2012. pp. 1–8.

129.

Shi

Jia

. Abnormal Event Detection at 150 FPS in MATLAB. In: IEEE International Conference on Computer Vision; 2013. pp. 2720–2727.

130.

Mehran

Oyama

Shah

. Abnormal crowd behavior detection using social force model. In: IEEE Conference on Computer Vision and Pattern Recognition; 2009. pp. 935–942.

131.

Stein

McKenna

. User-adaptive Models for Recognizing Food Preparation Activities. In: Proceedings of the 5th International Workshop on Multimedia for Cooking & Eating Activities; 2013. pp. 39–44.

132.

Stein

McKenna

. Combining Embedded Accelerometers with Computer Vision for Recognizing Food Preparation Activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing; 2013. pp. 729–738.

133.

Stein

McKenna

. Recognising complex activities with histograms of relative tracklets. Computer Vision and Image Understanding. 2017; 154: 82–93.

134.

Gorelick

Blank

Shechtman

Irani

Basri

. Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence. 2007; 29(12): 2247–2253.

135.

Schuldt

Laptev

Caputo

. Recognizing Human Actions: A Local SVM Approach. In: Proceedings of the 17th International Conference on Pattern Recognition; 2004. pp. 32–36.

136.

Rodriguez

Ahmed

Shah

. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2008. pp. 1–8.

137.

Soomro

Zamir

. Action Recognition in Realistic Sports Videos. In: Moeslund

Thomas

Hilton

, eds. Computer Vision in Sports. Springer International Publishing; 2014. pp. 181–208.

138.

Reddy

Shah

. Recognizing 50 human action categories of web videos. Machine Vision and Applications Journal. 2013; 24: 971–981.

139.

Shahroudy

Liu

Wang

. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 1010–1019.

140.

Yun

Honorio

Chattopadhyay

Berg

Samaras

. Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; 2012. pp. 28–35.

141.

Escalera

Baró

Gonzàlez

Bautista

Madadi

Reyes

, et al. ChaLearn Looking at People Challenge 2014: Dataset and Results. In: European Conference on Computer Vision Workshops; 2015. pp. 459–473.

142.

Escalera

Gonzàlez

Baró

Reyes

Lopes

Guyon

, et al. Multi-modal Gesture Recognition Challenge 2013: Dataset and Results. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction; 2013. pp. 445–452.

143.

Zhang

Liu

. Action recognition based on a bag of 3D points. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2010. pp. 9–14.

144.

Xia

Chen

Aggarwal

. View invariant human action recognition using histograms of 3D joints. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2012. pp. 20–27.

145.

Ellis

Masood

Tappen

Laviola

Jr Sukthankar

. Exploring the trade-off between accuracy and observational latency in action recognition. International Journal of Computer Vision. 2013; 101: 420–436.

146.

Khurram Soomro

ARZ

Shah

. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. CRCV-TR-12-01. 2012.

147.

Fabian Caba Heilbron

Escorcia

Niebles

. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015. pp. 961–970.

148.

Jiang

Chang

Ellis

Loui

. Consumer video understanding: a benchmark database and an evaluation of human and machine performance. Proceedings of ACM International Conference on Multimedia Retrieval. 2011; 29: 1–8.

149.

Jiang

Yang

Ngo

Hauptmann

. Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Transactions on Multimedia. 2010; 12(1): 42–53.

150.

Hospedales

Xiang

Gong

. Learning multimodal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014; 36(2): 303–316.

151.

Kurakin

Zhang

Liu

. A real time system for dynamic hand gesture recognition with a depth sensor. In: Proceedings of the 20th European Signal Processing Conference; 2012. pp. 1975–1979.

152.

Lan

Xing

Zeng

Yuan

Liu

. Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks. In: Proceedings of the 14th European Conference on Computer Vision; 2016. pp. 203–220.

153.

Chunhui

Yueyu

Yanghao

Sijie

Jiaying

. PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding. ACM Multimedia workshop. 2017.

154.

Bloom

Makris

Argyriou

. G3D: A gaming action dataset and real time action recognition evaluation framework. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; 2012. pp. 7–12.

155.

Kong

Jia

. Learning Human Interaction by Interactive Phrases. In: Proceedings of the 12th European Conference on Computer Vision; 2012. pp. 300–313.

156.

Hoogs

Perera

Cuntoor

Chen

Lee

, et al. A large-scale benchmark dataset for event recognition in surveillance video. In: IEEE Conference on Computer Vision and Pattern Recognition; 2011. pp. 3153–3160.

157.

Ryoo

Aggarwal

. UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA); 2010. http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html.

158.

Ofli

Chaudhry

Kurillo

Vidal

Bajcsy

. Berkeley MHAD: A comprehensive Multimodal Human Action Database. In: IEEE Workshop on Applications of Computer Vision; 2013. pp. 53–60.

159.

Cheng

Qin

Huang

Tian

. Human Daily Action Analysis with Multi-view and Color-Depth Data. In: Proceedings of the 12th European Conference on Computer Vision; 2012. pp. 52–61.

160.

Wang

Liu

Yuan

. Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition; 2012. pp. 1290–1297.

161.

Kristan

Matas

Leonardis

Vojir

Pflugfelder

Fernandez

, et al. A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2016; 38(11): 2137–2155.

162.

Laptev

Marszalek

Schmid

Rozenfeld

. Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition; 2008. pp. 1–8.

163.

Over

Awad

Michel

Fiscus

Sanders

Kraaij

, et al. Trecvid 2012-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of the TREC Video Retrieval Evaluation Workshop; 2012. pp. 1–58.

164.

Kuehne

Jhuang

Garrote

Poggio

Serre

. HMDB: A large video database for human motion recognition. In: Proceedings of the 13th IEEE International Conference on Computer Vision; 2011. pp. 2556–2563.

165.

Marszalek

Laptev

Schmid

. Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition; 2009. pp. 2929–2936.

166.

Niebles

Chen

Fei-Fei

. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. In: Proceedings of the 12th European Conference on Computer Vision; 2010. pp. 392–405.

167.

Ferryman

Shahrokni

. PETS2009: Dataset and challenge. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance; 2009. pp. 1–6.

168.

Shri

Jothilakshmi

. Crowd video event classification using convolutional neural network. Computer Communications. 2019; 147: 35–39.

169.

Carletti

Foggia

Percannella

Saggese

Vento

. Recognition of human actions from RGB-D videos using a reject option. In: New Trends in Image Analysis and Processing; 2013. pp. 436–445.

170.

Foggia

Percannella

Saggese

Vento

. Recognizing Human Actions by a bag of visual words. In: IEEE International Conference on Systems, Man and Cybernetics; 2013. pp. 2910–2915.

171.

Song

Demirdjian

Davis

. Tracking body and hands for gesture recognition: NATOPS aircraft handling signals database. In: Face and Gesture; 2011. pp. 500–506.

172.

Sun

Ross

Vondrick

Pantofaru

, et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018. pp. 6047–6056.

173.

Liu

Shahroudy

Perez

Wang

Duan

Kot

. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020; 42(10): 2684–2701.

174.

Zheng

Lai

Zhang

. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015.

175.

Pang

Yan

Shen

Hengel

Bai

. Self-Trained Deep Ordinal Regression for End-to-End Video Anomaly Detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020.

176.

Adam

Rivlin

Shimshoni

Reinitz

. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008; 30(3): 555–560.

177.

Doshi

Yilmaz

. Fast Unsupervised Anomaly Detection in Traffic Videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; 2020. pp. 2658–2664.

178.

Naphade

Wang

Anastasiu

Tang

Chang

Yang

, et al. The 4th AI City Challenge. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; 2020.

179.

Zhang

Tang

Zheng

. Fast collective activity recognition under weak supervision. IEEE Transactions on Image Processing. 2020; 29: 29–43.

180.

Wongun

Shahid

Savarese

. What are they doing: Collective activity classification using spatio-temporal relationship among people. In: IEEE 12th International Conference on Computer Vision Workshops; 2009. pp. 1282–1289.

181.

Choi

Shahid

Savarese

. Learning context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2011. pp. 3273–3280.

182.

Ibrahim

Muralidharan

Deng

Vahdat

Mori

. A Hierarchical Deep Temporal Model for Group Activity Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 1971–1980.

183.

Zhang

Ren

Sun

. Deep Residual Learning for Image Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 770–778.

184.

Redmon

Divvala

Girshick

Farhadi

. You Only Look Once: Unified, Real-Time Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 779–788.

185.

Ren

Girshick

Sun

. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017; 39(6): 1137–1149.

Machine learning for video event recognition

Abstract

Keywords

1. Introduction

2. Key elements of an event recognition system

2.1 Scene definition

3. Taxonomies for event recognition in videos

3.1 Taxonomy of features

3.1.2 Image description features

3.2 Taxonomy of machine learning algorithms

3.2.2 Unsupervised approaches

3.2.3 Supervised and unsupervised approaches

4. Research contributions

Table 2 Summary table of recognizable events with reference works and datasets used

4.1.1 Trajectory

4.1.2 Description vectors

4.1.3 3D data

4.2 Methods based on image description features

4.2.1 SIFT/SURF and canny ED

4.2.2 HOG and HOF

4.2.3 Deep features

5. Discussion

6. Conclusions

Footnotes

Acknowledgments

References

Table 2
Summary table of recognizable events with reference works and datasets used