Abstract
In recent years, the spread of video sensor networks both in public and private areas has grown considerably. Smart algorithms for video semantic content understanding are increasingly developed to support human operators in monitoring different activities, by recognizing events that occur in the observed scene. With the term event, we refer to one or more actions performed by one or more subjects (e.g., people or vehicles) acting within the same observed area. When these actions are performed by subjects that do not interact with each other, the events are usually classified as simple. Instead, when any kind of interaction occurs among subjects, the involved events are typically classified as complex. This survey starts by providing the formal definitions of both scene and event, and the logical architecture for a generic event recognition system. Subsequently, it presents two taxonomies based on features and machine learning algorithms, respectively, which are used to describe the different approaches for the recognition of events within a video sequence. This paper also discusses key works of the current state-of-the-art of event recognition, providing the list of datasets used to evaluate the performance of reported methods for video content understanding.
Introduction
Nowadays, video sequences are commonly available from various sensors, and frequently used in different domains. In the field of computer vision, visual-based event recognition systems are raising a great interest in a wide range of possible application areas, including image processing [1, 2, 3, 4], change detection [5, 6], health care [7, 8, 9, 10, 11], video surveillance [12, 13], and many others. Some of the most important and useful applications are: crowd and abnormal behaviour analysis, traffic monitoring and crime prevention [14, 15]. Video sensor networks are usually useful as tools for human operators to observe areas of interest remotely. The technological advancement of video cameras in terms of spatial resolution (e.g., 2 k or 4 k), temporal resolution (e.g., 30 fps or 60 fps), type of sensor (e.g., RGB or RGB-D) and movement (i.e. pan, tilt, and zoom) has led the computer vision scientific community to design smart algorithms capable of understanding the semantic content of videos [16, 17] and assisting the human operators actively in scene monitoring [18, 19, 20]. The understanding of video content is discerning the behaviour of moving subjects acting with each other or motionless elements, in the observed scene. The analysis of unmoving objects is relevant because some gestures can yield different actions according to such elements. In any case, the automatic event recognition remains an ambitious and complex task, due to both the challenge of interpreting correctly the actions performed by the observed subjects and the presence of disturbing factors (e.g., occlusions or illumination changes). The latter, though, are not addressed in this work as they are not a focus of the present survey. However, the limits of the state-of-the-art methods discussed are properly reported.
The analysis of action sequences performed by subjects within an observed scene is the first step to understand automatically the different events that may occur. Formally, in literature, events are defined as real-world occurrences unfolding over space and time, yielded by subjects in the scene [21, 22]. With regard to event recognition in videos, events can be generally defined as one or more actions performed by subjects within the scene and at a certain time interval. Some examples of events are: a person opening a door, two persons shaking hands, a car driving along a road, and so on. Depending on the number of subjects involved and the type of interactions, events can be classified as simple or complex [23, 24]. Simple events are defined as a set of primitive actions (e.g., walking or sitting) performed by one subject. Complex events are defined as a series of simple events, spatially and temporally correlated, that lead two or many subjects to interact between each other. The recognition of events is generally performed by using two well-known branches of computer science: computer vision and machine learning. The former allows to extract relevant information (i.e., features) from the video frames analysed, while the latter allows to learn and, subsequently, infer the semantic meaning of the video by relying on the previously extracted features. Notice that, once trained to understand a specific set of events, a recognition system should always work in real-time. This aspect is a key factor in many application fields, including traffic monitoring and public security.
In literature, several surveys have explored the human behaviour in different contexts, such as interaction with social media (e.g., Twitter or Facebook) [25]. In recent years, methods based on the visual recognition of events triggered by moving subjects have been proposed as well [26, 27]. Nonetheless, the current state-of-the-art lacks several key factors, such as: formalizations for both scene and event, a practical key element for those wanting to understand the main aspects of an event recognition system quickly (hereinafter ER system), good and complete guidelines on the development of smart algorithms for such type of task. Therefore, unlike previous works, the proposed survey introduces two taxonomies based on features and machine learning algorithms, respectively, that classify the current approaches used to handle events in videos. The paper also provides a careful analysis of selected works, summarizing their main characteristics (i.e., features, algorithms, type of events and datasets) in concise tables. Based on a bottom-up approach, the proposed survey starts by formally defining scene and event, and continues by reasoning on event effects and then by describing the logical architecture for a generic ER system. In particular, the paper is structured as follows: Section 2 provides formalizations of both scene and event, reasoning on event effects, and sequentially presenting the aforementioned logical architecture of a generic ER system; Section 3 introduces the two taxonomies defined and highlights the role of the considered features and machine learning algorithms; Section 4 provides a set of selected key works in the current state-of-the-art, even summarized with some observations in concise tables; Section 5 discusses the features and machine learning algorithms used for the automatic event recognition task. Finally, Section 6 concludes the paper providing observations on the contributions of the proposed survey.
Key elements of an event recognition system
This section provides the formal definitions of a scene and of simple and complex events. Moreover, it introduces the formalism to represent events and reason about them and their effects by using the dialect of event calculus for run-time reasoning (RTEC) [28], based on event calculus [29]. With such tools, every ER method can be translated into a formal framework. Finally, it describes a logical reference architecture highlighting the main components of a generic ER system.
Scene definition
Events are triggered by different subjects that act in an environment defined as a scene. A scene can be classified as indoor scene or outdoor scene [30]:
an indoor scene is defined as a covered area of any shape and type. Typical examples of an indoor scene are rooms, hangars, stations, airports, and others. But also the interior part of a bus, of a train, of an airplane, and so on; an outdoor scene is defined as an open area of any shape and type. Typical examples of an outdoor scene are port areas, airstrips, outdoor parking lots, runways, border areas, city centre (e.g., squares, streets or parks), and others.
Both classes of scene are described by a set of main properties, including the environment (e.g., natural or artificial) and noise (e.g., intrinsic or extrinsic). Concerning the environment, any type of scene is usually characterized by different static objects that are independent of the moving ones acting in it. In outdoor scenes, some examples are parking booths, street lamps, trees, and others. Instead, some examples in indoor scenes are represented by walls and furniture. In any case, these objects have several attributes (e.g., dimension or shape) that should be taken into account during the design process of an ER system. Concerning noise, in any type of scene, intrinsic noises are defined as any changes that may generally occur. Some examples in outdoor scenes involve the movement of a flying flag, the opening and closing on an electric gate, illumination changes, shadows, and others. Similar examples can be reported in indoor scenes. However, this type of noise is usually well-handled and does not influence the capability of a system to recognize events. In fact, intrinsic noises are characterized by repetitive and well-known patterns that can be removed by using popular algorithms (e.g., Gaussian models [31]). Extrinsic noises are related to elements not strictly belonging to the scenario, but which can occur occasionally. Typical examples are uncommon weather conditions such as snow, rain, and fog. However, modern computer vision algorithms handle this kind of noise properly [32, 33]. Even small moving objects (e.g., insects) that occlude the sensor are an example of an extrinsic noise. Actually, the latter is still considered an open issue. At each time instant, each scenario can present one or more noise elements related to the types previously introduced.
From a formal point of view, a scene can be defined by three layers that represent three different states of the environment observed. The first layer is the reference scenario (
where
where
The second layer is the background scenario (
where
Besides the static elements, at each time instant, the scenario can include subjects that may interact with the latter or with each other. Formally, the set of subjects can be defined as:
where
where
These concepts allow to provide the definition of the last layer, specifically, the dynamic scenario (
RTEC dialect
Following the formal definition of the three scenarios, the next paragraph provides the formal definition of events. In the state-of-the-art, events are usually divided into simple events and complex events. The reported examples of simple and complex events were extracted from public datasets often used to support the evaluation step of ER systems [34, 35, 36].
Simple events describe a set of primitive actions performed by subjects. In this type of events, subjects perform actions by interacting with the environment, but not with each other. The actions performed by a subject, if temporally and spatially correlated, trigger a simple event. Formally, let
where
Examples of (a)(b) simple events and (c)(d) complex events.
The RTEC dialect, summarized in Table 1, can be used to represent simple events and to model their effects. Considering the riding horse event in Fig. 1b, such event can be formalized as follows:
In the rule Eq. (2.2), the fluent
Complex events can be defined as a set of simple events that are temporally and spatially correlated among each other. Taking into account the time interval
where
Just as for simple events, the RTEC dialect can be used to represent complex events and to model their effects. Considering the hugging event in Fig. 1d, such event can be formalized as follows:
In rule Eq. (2.3), it is possible to notice that, in antecedent blocks of rules, predicates
According to rule Eq. (2.3), two subjects are fencing as long as at least one of them is fighting, the other is inactive, and they are close. The interval manipulation constructs specified in RTEC are useful to define that: for all time-points
Logical architecture of a generic ER system.
Figure 2 shows an overview of the different modules that make up the logical architecture of a generic ER system. The scene acquisition module uses one or more video sensors to acquire the scene. Usually, these sensors are RGB cameras but, depending on the scene, they can also be RGB cameras with depth information (i.e., RGB-D), stereo cameras, infrared (IR) cameras, or a mixture of them all [37, 38]. To manage the recognition of events properly, an ER system should acquire all three scenarios
Subsequently, the feature extraction module is used to extract relevant information frame-by-frame, in order to describe uniquely the different subjects appearing in the scene. Such features can be of different types (e.g., trajectories or histograms) and used separately or combined. The extracted features are used in two different modules, namely event learning and event recognition. The former learns and trains the event recognition module, which is updated and learned based on the newest features extracted from the observed scene. This step allows the ER system to recognize an increasingly bigger set of events. The latter, once trained, tries to classify the recognized events properly, always according to the extracted features. Any additional information about each classified event (e.g., kind of event, level of dangerousness, or number of subjects involved) can be provided to an operator by a human-computer interaction interface. Notice that, the dialogue among different modules and, specifically, among those designed to learn and classify events, depends on several factors, including the complexity of the ER system, the number of recognizable classes, the type of video sensors involved, and so on [39, 40]. The in-depth analysis of such factors is not a focus of this paper. However, the architecture highlights that it is necessary to consider two key factors in the implementation of an ER system: the features by which each scene is analysed, and the algorithms through which these features are processed in order to learn and classify the events. The next section provides two taxonomies based on features and machine learning algorithms, respectively.
Taxonomies for event recognition in videos
This section starts by proposing a first taxonomy based on the most used features for event recognition in video sequences. Subsequently, it continues by proposing a second taxonomy based on the most used machine learning algorithms for the same task. These criteria were chosen because features and algorithms are cornerstones of machine learning. The two taxonomies provide the keys to understanding how to describe and recognize actions or interactions performed by subjects in a video. We built such background maps by studying a broad set of selected works in the current literature to summarize the main aspects of the state-of-the-art methods, useful for the reader as a quick overview during the development of smart algorithms for event recognition in videos.
Taxonomy of features
The taxonomy shown in Fig. 4 depicts the different types of features used to describe actions or interactions performed by subjects in a video. Such features are often chosen by taking into account the algorithm used to learn and recognize events and the type of raw data acquired by the specific video sensor. However, the choice mostly depends on the type of event that needs to be classified. Features can be organized into two main sets: spatio-temporal features (STF) and image description features (IDF). Figure 3 shows examples of the different types of features discussed in this section, starting from the original image in Fig. 3a and proposed in [41].
Examples of the different types of features for event recognition: (a) original image, (b) trajectory, (c) depth map, (d) skeleton joints, (e) Canny ED, (f) SURF, (g) SIFT, (h) HOG, (i) deep features.
Taxonomy of the most used features for event recognition in videos.
The STF set includes all those features describing spatial and temporal relations either among subjects or between subjects and a scene. A feature frequently used is the trajectory of each subject shown in Fig. 3b. This type of feature can be extracted by using the pixel information [42, 34] acquired with a standard RGB sensor, or by using the skeleton information acquired with an RGB-D sensor [37].
A different approach is the use of description vectors. Such vectors can contain any type of data, including axis velocities [43, 42], spatio-temporal volumes [44], and a sequence of poses [36]. Besides, there are approaches based also on 3D data, making use of either local or global spatio-temporal features extracted from depth maps (Fig. 3c), or specific joints and measures of the skeletons (Fig. 3d). In several cases, the approaches mentioned above are also mixed by using standard 2D RGB images [45, 46, 47, 38, 40, 48].
Image description features
The IDF set includes all those features containing information about some interesting points of subjects and, in some cases, of a scene (e.g., edges, corners or others). In the current literature, several works are supported by image descriptors, such as Canny edge detector (Canny ED) [49] in Fig. 3e, speeded up robust feature (SURF) [50] in Fig. 3f, scale invariant feature transform (SIFT) [51] in Fig. 3g, or related modified versions. More specifically, SIFT and SURF feature extractors are used to detect a set of distinguishable local features (i.e., key points) from the images. Such extractors are designed to be reasonably invariant under scale, rotation, and translation of local features. Besides, with the aim to make these extractors robust enough under noise and illumination changes, different approaches adopt them only in regions of images having high contrast levels. Considering low-level features, in the state-of-the-art, several works use histogram of optical flow (HOF) [52], histogram of oriented gradients (HOG) [53] shown in Fig. 3h, and spatio-temporal gradients [36]. In general, such algorithms provide features similar to SIFT and SURF but, rather than extracting them from high contrast regions of the image, they are computed on a grid composed of cells spaced uniformly. To improve accuracy, an overlapping local contrast normalization is usually adopted.
Recently, deep learning techniques are often used to learn discriminative image-based features as shown in Fig. 3i. They are referred to as deep features [54, 55, 56], and each of them is the consistent output of a unit, within a hierarchical model, to the given input. The depth of these features depends on their position, along with their hierarchical structure. Moreover, they are non-linear, discriminant and invariant, performing well in image classification and target detection problems [57].
Taxonomy of machine learning algorithms
The taxonomy in Fig. 5 shows the different types of machine learning algorithms usually applied for event recognition in video sequences. Such algorithms can be divided into two main sets: supervised and unsupervised. An additional set concerns the approaches that can be classified either in the supervised or unsupervised set. For completeness, Section 3.2.3 describes some of these methods.
Taxonomy of the most used machine learning algorithms for event recognition in videos.
The algorithms belonging to this set of methods require labelled training data or, specifically, the desired output of dataset instances. The support vector machine (SVM) [58, 34, 50, 36] is widely used in event detection and recognition. This type of approach aims to classify the observed data by finding the maximum hyperplane, in the feature space, separating elements belonging to different classes.
Another supervised method is boosting [59, 45], which uses several weak learners, or learners whose classification result is slightly better than a random guess, to create a stronger classifier whose classification result is strongly correlated with the ground truth.
Finally, the multiple instance learning (MIL) [60] can be considered a variation of the supervised learning paradigm. In this case, the algorithms receive a set of labelled instance bags during the training step, rather than a set of instances individually labelled.
Unsupervised approaches
The algorithms belonging to this set do not require any labelled training data since data labels are inferred by the actual algorithms. A common unsupervised approach, generally used in event recognition, is clustering [44, 51, 43]. Clustering algorithms aim to generate well-defined and homogeneous partitions of a feature set, usually composed of the most salient natures extracted from the observed scene.
Other types of unsupervised algorithms are the probabilistic approaches [61, 62], which can predict the class to which the observed data belong, given an input and a probability distribution for each instance.
Supervised and unsupervised approaches
As mentioned, another set of methods involves the algorithms which can be used either in a supervised or unsupervised way. Some well-known approaches adopted for event recognition, falling within this set, are artificial neural networks (ANNs) [63] and deep learning strategies [64].
The ANN are algorithms inspired by the aspect of the biological neural network and its hierarchical structure. The artificial networks consist of layers of connected artificial neurons and are used to learn the relationships between input/output data and patterns [65]. Such neurons are computational units of the hierarchical model, and each layer represents a level in the hierarchy. There is no limit on the number of layers. Owing to the recent improvement of the hardware computational capacity, the addition of more and more layers has led to create more and more deep ANNs.
Deep learning involves the construction of machine learning models to learn hierarchical representations of data within a set of images, including a set of edges, a set of regions with particular properties, and so on. Notice that, in order to solve a fixed task, some representations are better than others (e.g., object segmentation or face recognition). The capability of deep learning to obtain salient features has been explored also for event recognition, achieving remarkable results [66, 67].
In the following section, the large set of state-of-the-art works, that allowed us to define the reported taxonomies, are examined in detail.
Research contributions
This section analyses the features and machine learning algorithms taxonomies and discusses the most relevant and recent works for event recognition organized on the basis of the features used. For each work, limits are reported, along with the combination of features and algorithms exploited. Moreover, the reported works are summarized in concise tables showing features and algorithms, recognizable events, datasets and application fields (Tables 2 and 3).
Summary table of recognizable events with reference works and datasets used
Summary table of recognizable events with reference works and datasets used
Summary table of application fields, detectable subjects, used features and machine learning algorithms
This section describes some recent key works that use the spatio-temporal features to detect and recognize events.
Trajectory
Sun et al. [42] consider the trajectories generated by the sequences of movements of human subjects in the observed scenario for activity recognition. In detail, the authors propose a method for modelling the subjects’ trajectories by combining both beta process (BP) [68] and hidden Markov models (HMM) [69], thus generating a novel approach: the beta process hidden Markov models (BP-HMM). The motions are shared among trajectories using a set of movement labels, which correspond to the movements observed in the scenario, and a binary vector of features for each trajectory. The elements of such vectors are 1, if the trajectory contains the corresponding movement label, and 0 otherwise. The weights of state transition of the HMM are computed by using a Gamma distribution and, subsequently, the same weights are combined with the trajectory feature vector to obtain the transition distribution of the trajectory nodes. The sampling of each trajectory (i.e., the binary vector of features, the state transition of the HMM and the transition distribution) is performed with Markov chain Monte Carlo (MCMC) [70]. Finally, each trajectory is classified by taking the class that maximizes the log-likelihood of the trajectory probability. The limit of such method is given by the misclassification of events composed of flexible and different activities, or the misclassification of events that share motions.
Xu et al. [43] propose a method to detect events in crowded scenes by exploiting MPEG codec [71] to build motion vectors. The MPEG discrete cosine transform (DCT) coefficients are then used to compute a foreground map in order to remove the background trajectories. The following steps consist in transforming the trajectories into the Fourier domain and in quantizing the Fourier representations into visual words by using the k-means algorithm [72]. Besides, the authors propose a MIL algorithm that models the observed scenario as a linear sparse combination of independent events [73, 74], where each event is a distribution of visual words. The limit of this method is that the event of interest must be independent of the background activities.
Description vectors
Li et al. [44] propose a three-step framework for anomalous event detection and localization. The first step consists in dividing the considered video into spatio-temporal volumes of fixed size. From each volume, the spatio-temporal gradients are computed by using a first order differential approximation. Then, two histograms, for each gradient, are computed to be used as features. The extracted features are used as input for the fuzzy weighted c-means clustering algorithm [75], in which the elements are not assigned definitively to a class, but have different degrees of membership for each class. The second step consists in dividing the scenario into cubes, which are spatio-temporal windows of dimensions greater than volumes. In turn, the latter are divided into eight blocks. By using the bag of word (BoW) [76] approach, the membership grades, belonging to the same cluster, are grouped to the related blocks, thus obtaining a set of local behaviour patterns represented as histograms. Each histogram presenting different values with respect to the not anomalous local behaviour patterns previously computed can be considered as an anomalous event. In particular, in the last step, to detect the anomalous events, the authors use a sparse coding approach shown in [77]. To quantify the difference between an anomalous and a not anomalous event, the authors designed a cost criterion of sparse reconstruction. A typical event is characterized by a low reconstruction cost, while an anomalous event has a high reconstruction cost. The limit of this method is the dimension of volumes, which represents a key role in obtaining a satisfactory result in this type of event detection.
Moayedi et al. [36] propose an action recognition method in which the actions are represented as a sequence of poses. In particular, each single image (i.e., a frame) belonging to an action sequence is considered as a pose, on which the motion history image (MHI) [78] algorithm is computed. The MHI is a template in which the pixel intensity is a function of how recent the movement is in the sequence. The MHI is computed taking into consideration the frames immediately to the left and to the right of the one considered. Since the human body shape and the orientation of its parts provide discriminant information in many actions, the set of MHI images is represented as HOG descriptors and stored as a matrix. Next, a dictionary is created by using the BoW model. This dictionary is learned from a weighted linear combination of base vectors, which are obtained from a set of example videos. At this point, the matrix of HOG descriptor is encoded by using a sparse representation to encode input data structure. Finally, the sparse matrix is transformed into a histogram set, which represents the weight of each dictionary base in a video sequence. These histograms are the input for an SVM classifier. The limit of this approach is that if a more complex algorithm is used compared to HOG, it loses the real-time property.
3D data
Regarding the use of depth data, Chen et al. [45] propose a 3D action recognition method which combines unsupervised feature learning with the extraction of spatio-temporal features from unlabelled video data. In detail, two types of features are used: the local independent subspace analysis feature (LISA) and the global independent subspace analysis feature (GISA). The advantage of using these features is that they are both invariant to the translation of the human body, and robust to noise. The features are computed by extracting the depth subvolumes in a neighbourhood of the skeleton joints. Each subvolume is then normalized and whitened to reduce the input dimensionality. The next step consists in treating a subvolume as a sequence of depth images that are flattened and arranged in a vector. The latter represents the input of an independent subspace analysis (ISA) neural network [79, 80], which is trained similarly to the other algorithms described in [81, 82]. After the learning phase, two histograms (i.e., one for LISA features and one for GISA features) are associated to each joint by adopting a BoW approach. Since some joints are more important than others in recognizing specific actions, the authors used the adaboost method [59] to understand which joints are more relevant in the different types of actions. In detail, the proposed algorithm, named EnMkl (ensemble multi-kernel learning), boosts a set of multiple kernel learning support vector machines (MKLSVM) [58], and provides as output the combination between the movements of the acquired subject and the related actions, such as jogging, object throwing, hand clapping. The limit of this method is the recognition of actions that share similar movements.
Slama et al. [46] address the problem of action recognition from a purely geometrical point of view, where an action observability matrix is characterized as an element of a Grassmann manifold. Precisely, a two-step approach is proposed. As for [45], the joints of the subjects are the focal point to understand actions in a 3D environment. In the first step, the 3D spatial coordinates of the joints are extracted, and the temporal sequences reproducing the movements are built. Each movement sequence (i.e., an action) is represented as a matrix composed of
Chen et al. [47] present a framework for human action recognition, in which the depth feature representation is obtained with the fusion of 2D and 3D auto-correlation of gradient features. More specifically, three depth motion maps (DMM) representing front, side, and top views of a subject are created by using the projection-based algorithm reported in [85]. After the generation of the DMM, the gradient local auto-correlation (GLAC) [86] features are extracted. The next step consists in extracting local relationships among space-time gradients using the STACOG [87] algorithm. The last step consists in performing a weighted fusion, at decision-level, to combine the 2D DMM-based GLAC with the 3D STACOG auto-correlation of gradient features. The machine learning technique adopted is the extreme learning machine (ELM) [88], which is an efficient algorithm for single hidden layer feed-forward neural networks (SLFNs), applied in various application fields [89, 90, 91]. The GLAC and STACOG features, individually, are given as input to two ELM classifiers, and the probability outputs, from each classifier, are combined to generate the outcome (non-negativity and sum-to-one constraints are imposed). Although the result obtained is good, classifier weights are determined and fixed on the basis of training and testing samples, respectively.
Furthermore, the work proposed by Fan et al. [40], merges the RGB and depth data to recognize the human behaviour. A first step consists in a video pre-processing to correctly derive the max outline of the history behavior binary image (MOHBBI) characteristics. For each piece of RGB video and depth video, the visual background extractor (ViBe) algorithm [92] is applied to differentiate the foreground and the background, thus generating binary images. Next, the union operation on each binary image is performed to obtain the correspondent MOHBBI for the RGB and depth image sequences. In order to retain the maximum information of the RGB-MOHBBI and Depth-MOHBBI, and with the aim to remove noise, an intersection operation is performed on the two MOHBBI, thus creating the mixed MOHBBI. A uniform local binary pattern (ULBP) [48] is applied to the latter to extract the local texture features. The same background subtraction and binarization methods are used to process the RGB and depth image sequences and to get the spatial-temporal local texture features. Finally, the 3D image volume is projected on 2D planes in order to extract features that can represent the human activity features in the spatio-temporal domain. Concerning machine learning algorithms, both K-nearest neighbour [93] and HMM are used to detect the activities. Unfortunately, this approach is suitable only for upper limb activities.
Zhang and Parker [38], for human activity recognition, propose a new local spatio-temporal feature, called 4D color-depth (CoDe4D), which incorporates both the intensity and depth information acquired by RGB-D cameras. The CoDe4D multichannel feature detector is based on a saliency map, which allows to extract interest points in the xyzt space (i.e., local pose, shape, and texture variations in the 3D spatial dimension xyz and 1D temporal dimension
Wang and Wang [95] propose a two-stream recurrent neural network (RNN) model, based on long short-term memory (LSTM) units [96], in order to classify different actions by processing both temporal dynamics and spatial configurations of 3D skeleton joints. The temporal stream, consisting of joints coordinates at different time steps, tracks the movements of such joints. The spatial stream, instead, by casting the spatial graph of articulated skeletons into a sequence of joints, displays a spot of the visual form of skeletal data. To model the temporal dynamics, the authors examined two different LSTM architectures: a stacked one and a hierarchical one. The hierarchical LSTM structure – in which the skeleton given as input is divided into five sub-parts – was found to be more efficient and suitable to process the motions of the different sub-parts as well as the whole body. The simple stacked LSTM was used to model the spatial configurations. Because of the two-stream structure of the whole model, the combination of softmax class posteriors from the two streams results in the final action class. The limit of this method is the lack of an informative joints selection mechanism. Not all joints are meaningful in the action analysis process, and useless ones can introduce noise leading to a worse performance for the action recognition system.
In order to overcome the limits of previous works, Liu et al. [97] describe a method to recognize actions based on 3D skeletal data and a global context-aware attention LSTM (GCA-LSTM) network. The GCA-LSTM is a new two-layer LSTM model designed to identify informative skeleton joints for a specific action class frame-by-frame, by using global contextual information. Given the skeleton sequence as input, the first layer produces the initial global context information memory. Such memory is used by the second layer, for each frame, to select informative skeleton joints and to refine the global contextual information. Therefore, the context information is fed to the network at all steps and refined progressively. In this way, if a new input is important for the action analysed, the network saves more information; otherwise, the network blocks it. Finally, when the skeleton sequence is completed, the global contextual information is given as input to a softmax classifier for action class prediction. The limit of this method is related to the number of attention mechanism iterations. In fact, the authors experienced that too many iterations led to performance degradation.
Yang et al. [98] describe a novel convolutional neural network (CNN) [99] based method with attention mechanism for human action recognition in videos. Given 3D skeletal data, the attention mechanism is used to select important joints according to the specific action. The skeleton is represented by using a depth-first tree traversal order rather than by chaining joints with a fixed order. This should allow the semantic meaning of skeleton images and structural information to be better preserved. The authors propose a global long sequence attention network (GLAN) to model spatio-temporal key stages and filter out unreliable joint predictions. Moreover, they introduce a sub-sequence attention network (SSAN) to adjust spatio-temporal aspect ratios and better learn long-term dependencies. This method is limited in handling incomplete or inaccurate estimated poses.
Focusing on the on-line action recognition problem, Liu et al. [100] introduce a framework based on 3D skeleton sequence streams combined with deep learning strategy. A dilated version of a CNN, called scale selection network (SSNet), is used to model the motion dynamics in temporal dimension by exploiting a sliding window. This type of network learns the temporal window scale for each time step dynamically, rather than using a fixed scale window. Besides recognizing the action class, in order to identify the performed part of the ongoing action, the proposed method regresses the temporal distance to the starting point of the current action instance. In this way, at the next temporal step (i.e., next frame), this value can be used as the temporal window scale for action label prediction, trying to suppress the possible incoming interference from previous actions. The convolution layers in the temporal dimension are useful to model the motion dynamics within each perception window, such that in SSNet different layers correspond to different temporal scales. Therefore, at each time step, the convolutional layer is selected, covering the most similar window scale regressed by its previous step. The activated layer is then used for action recognition. The limit of this method is that the average action prediction accuracy decreases in ending stage of some action instances.
In their recent work, Bourouis et al. [101] propose a traffic monitoring system based on a 3D car model recognition. They developed a Bayesian inference-based framework by exploiting scaled Dirichlet mixture models. The authors chose the Bayesian learning to deal with uncertainty by introducing prior information on parameters. Generally, the methods implemented for traffic monitoring require useful approaches for data clustering and modelling. Moreover, they should have good generalization capabilities and maintain a large amount of information about the data. Relying on the idea that the detection and tracking of vehicles can lead to achieve better performance in traffic control, the authors – for the proposed method – firstly focus on car detection and tracking and, secondly, on traffic scene monitoring. For car detection, initially, SIFT vectors are extracted from images representing cars using the difference-of-Gaussians (DOG) interest point detector [102]. Subsequently, such vectors are quantized through the k-means clustering in order to construct a visual vocabulary, then each object representing a car is represented by a frequency histogram over the constructed visual words. Finally, a probabilistic latent semantic analysis (pLSA) is applied in order to represent all images, and Bayesian classifiers are applied to identify cars by following Bayes’s decision rule. For the tracking process, the authors apply the weighted scaled Dirichlet mixture models after normalizing the pixels values. Since this approach is based on car detection, tracking and recognition, it can be limited as it may miss useful visual features.
More recently, Zhang et al. [103] propose a method for skeleton-based human action recognition in videos, enhancing the features representation capability. This enhancement is obtained by introducing a semantic-guided neural network (SGN) that exploits high-level semantics of joints. Given a skeleton sequence, each joint is described by its type and frame index together with 3D coordinates and velocity. All the information are processed by the proposed network consisting of a joint-level module and a frame-level module. The former, based on graph convolutional network (GCN) [104], used to model the correlations among joints in the same frame. The latter, based on CNNs, used to handle correlations across frames by merging the information of all joints in a frame. Finally, the action classification is performed with the last softmax layer. Essentially, the cooperation of different body parts is required to perform an action, but only some of them play key roles. This method at the joint-level stage enables adaptive graph construction to model the skeleton but lacks indicators of centrality to identify key information for each action.
Methods based on image description features
This section presents some recent key works that use image description features for event recognition.
SIFT/SURF and canny ED
Cheng et al. [51] propose an event recognition method to model temporal dependencies in the data, at a sub-event level, without using event annotations. In detail, an input video is divided into segments with a fixed length. From each video segment, several features are extracted by using the BoW method on the motion SIFT (MoSIFT) [105] features. The latter differs from the original SIFT in the descriptor, which combines the standard SIFT descriptor with the HOF approach. Subsequently, the video segments are grouped with the k-means algorithm in visual words, and a word is assigned to each one. In this way, a video is represented as a visual words sequence. Since a visual word of an event can be statistically correlated to a visual word of another event distant in time, a sequence memorizer (SM) [106] – that is a hierarchical non-parametric Bayesian model without depth constraints – is used to model these interactions. In particular, the SM is applied to each visual word in order to obtain a more robust and efficient approach. As to the classification of events, several multi-class SVMs are trained for each type of event. Finally, to identify the fine-detailed temporal structures in data, the number of clusters used to generate the visual sequences increases with the complexity of the scene. Unfortunately, this value must be determined empirically in the experiments performed.
Wang and Ji [66, 67] propose a deep hierarchical context model for event recognition. In particular, the model learns the features, semantics and prior aspects of a context. With regard to features, the appearance and interaction features are extracted from the event neighbourhood (i.e., area around the event bounding box). In particular, the appearance features capture the appearance of non-target objects within the event neighbourhood. This capture is performed by using the SIFT key points and the related descriptors, which are then encoded with the BoW method. The interaction features capture the interactions between the event objects and the contextual objects. The SIFT key points are also used to detect the event objects. Then, the k-means algorithm is used with the key points in both the event bounding box and the event neighbourhood to generate a dictionary. The dictionary is represented by a 2D histogram, used to capture the co-occurrence frequency of its words, which are both inside and outside the event bounding box among frames. The semantic level captures the interactions among events, persons and objects. These interactions are modelled as a network composed of three layers. The top layer is a label vector that provides information on which event belongs to which class. The bottom layer consists of two vectors, one for objects and one for persons. The middle layer connects the object, person and event units to capture the interaction between them. Finally, the prior level uses two types of prior contexts: the scene priming and the dynamic cueing. The former refers to scene information, such as location (e.g., parking lot or shop) and time (e.g., noon or dark), used to dictate whether certain events occur. Instead, the dynamic cueing provides temporal information for the current event prediction, given a previous event. The events are identified by using a six layers model, which contains units relating to target and contextual measurements, person and object observation, event and context features, interactions, event labels, scene states and scene observation. A limit of this model is that it is designed for two interacting entities, typically one person and one object, and it cannot be directly applied to other scenarios where more than two entities are involved.
Ribeiro et al. [107] describe the use of a deep convolutional autoencoder (CAE) [108] for the anomaly detection task, with the aim to capture the 2D structure of image sequences during the learning stage. The working idea is that a CAE is capable of learning regular events in videos, and the reconstruction error of each frame can be used as an anomaly score. The anomaly recognition problem is presented as a binary classification task: one class is human-defined, whereas the other one is an anomaly class. The proposed method uses high-level appearance or motion features combined with input frames and, subsequently, the authors evaluate what type of features, aggregated to the input data, performs better. Appearance features are extracted by using the Canny ED to detect discontinuities (i.e., set of points), where image brightness changes sharply, organized as edges. Motion features, extracted through the optical flow, are used to describe displacement or speed related to the distance that a pixel covers between two subsequent frames. Features and frames are mixed to generate different scenes as input data for the CAE. The events therein are classified by using the reconstruction error. A limit of this approach is that the more spatially complex is the video, the harder it is for the CAE to classify anomalies. According to the study, appearance features are more efficient for the task, rather than the motion ones, when combined with input data.
HOG and HOF
The work proposed by Wu and Hu [34] is focused on capturing crucial motion patterns by employing latent models for human action recognition. In detail, an event is defined for each class of action. These events are modelled from a temporal segment of the video by using the temporal pyramid model (TPM) [109], which allows to extract low-level features from the video as dense trajectories [110]. These are obtained by tracking and sampling the points of the trajectories on multiple spatial scales. The trajectories are represented with five descriptors: the shape of the trajectory, the HOG, the HOF and the motion boundary histogram (MBH) [111]. In order to detect the crucial motion patterns, the authors used a multi-class SVM with latent variables solved iteratively by using the concave-convex procedure [112]. An issue of this method is that the classification performance strongly depends on the optimization parameters.
Chen and Zhang [50] use another type of low-level features: the improved trajectories features (ITF) [113], which are an improved version of the dense trajectories for action recognition. The feature descriptors remain the same as [34], but the difference is in isolation and in removing the background trajectories and camera movements. First of all, the five descriptors are computed by using RANSAC [114] and by using SURF to compute the homography between two consecutive frames. Then, the optical flow is warped with the estimated homography, and motion descriptors are computed on the warped optical flow. Once the feature descriptors are extracted, their dimensionality is reduced by the principal component analysis (PCA) [115] to speed up the computation, and then they are encoded with the Fisher vector (FV) [116] to improve their representation. The next step consists in constructing the clusters of the trajectories that could be part of an action. This is done by adopting the hierarchical divisive clustering algorithm [117]. Besides, in this work, the SVM is used to recognize events. A limit of this method is that it does not allow the real-time processing of the video.
Wang and Snoussi [52] introduced the histogram of optical flow orientation (HOFO) [118] as a feature descriptor for abnormal event recognition in videos. First of all, the HOF is used to extract the low-level features. Next, the HOFO descriptor is computed over dense and overlapping grids of spatial blocks, with optical flow orientation features extracted at fixed resolution and gathered into a high dimensional feature vector to represent the movement information of the frame. The descriptor is calculated at each block, and then it is accumulated into one global vector for each frame. The analysis of the HOFO blocks allows the authors to model the interaction among the motions of the local blocks. The classification of events is performed with a one-class SVM, which takes the HOFO descriptors as input to obtain the support vector used for the on-line frame classification. An event is identified as abnormal if it is detected over a fixed number of frames. However, if the size of the block is too small, the SVM is not robust and can classify an abnormal situation when a movement occurs in an empty environment.
Khan et al. [119] present a solution to detect anomalies in a crowd. The video sequences in the input are processed frame-by-frame, and each frame is divided into super-pixels. For each resulting super-pixel, a HOF is extracted to model the dominant motion direction. Subsequently, the histograms from consecutive frames are combined to obtain the final features for the video. For the anomaly detection task, the authors propose a univariate Gaussian discriminant analysis with the k-means algorithm combined with a linear SVM for classification. In the presence of redundant information, like in most real surveillance videos with an event of interest occurring very rarely, some false positives are expected.
Deep features
Ijjina and Chalavadi [54] present a key pose based method to recognize actions by using motion features computed from RGB videos and depth sequences. The authors chose motion features as they are useful to emphasize human activities in different temporal regions, improving discrimination among different actions. In their work, the authors use a CNN-based model to extract motion features, referred to as convnet features. The temporal templates, used to take the motion sequence in a single image, are the classical MHI and motion energy image (MEI) computed as the weighted sum of motion data in a video, where the differences between frames are used to compute the motion among them. In detail, temporal templates are computed separately for both RGB and depth streams and, subsequently, given as input to two different CNNs for motion feature extraction. The features obtained are the inputs of an ELM classifier used to predict the action class for the video. The limit of this method is that the temporal templates are sensitive to the angle of view; therefore, the presented approach can be less efficient if used on unconstrained videos.
Yeung et al. [55] describe a method capable of capturing multiple simultaneous actions, within the same video sequence, by annotating the video with multiple dense labels. Moreover, the authors handle temporal interactions between consecutive actions with a new multiLSTM model that features temporally-extended input and output connections. The design of their innovative LSTM model allows to refine the predictions in retrospect, namely after processing more and more frames in input from the beginning. Even if the same temporally-extended features can be achieved using a bi-directional RNN (BRNN) [120], the proposed LSTM improvement can be used in on-line settings with short-time lags. Just as for the previous work, the features are convnet features, and they are extracted with the VGG 16-layers CNN [121] pre-trained on ImageNet [122] and fine-tuned, on a single frame level, on the proprietary dataset. The resulting features are the inputs for the multiLSTM network used to annotate the video densely. The limit of this method is related to the soft attention input-output temporal context mechanism implemented for dense labelling. It does not improve the output predictions significantly, due to the increased fragility when the attention is close to the output, without network layers in the middle to add robustness.
Girdhar et al. [123], instead, propose a network based on both context information and attention mechanisms for action recognition. Their architecture is similar to the faster R-CNN, with base and head networks. The former extracts features and region of proposals for the humans present in the video, by using 3D convolution layers. The latter aggregate contextual information, predict the actions and regress a bounding box, by using the resulting features of each region. Besides, the implemented attention mechanisms learn to emphasize hands and faces, which can be crucial to discriminate an action, without an explicit supervision other than boxes and class labels. The limit of the proposed network architecture is related to the number of humans detected, as the authors found a decreasing performance when adding people to the scene.
As resulting from psychophysical experiments, humans can recognize and predict events from videos by observing hand movements during the preparation and execution of actions (i.e., before and after contact with an object). It seems that the visual information in the initial steps of an action is sufficient for the observers to understand it. Based on the idea that humans constantly update their belief concerning observed and future events, Fermüller et al. [56] propose a method capable of classifying dexterous actions in terms of dynamics and forces, also predicting the effects of forces on objects. Therefore, the proposed method is composed of two tasks: the first one is the action prediction from videos; the second one is the estimation of the tactile signal of the action. As to action prediction, visual features are extracted from the video by using a pre-trained VGG 16-layers CNN and, subsequently, they are given as input to an LSTM model. As to the recognition of an ongoing action, given the video sequence, the classification is updated frame-by-frame, instead of an action label to the whole video. As to the tactile signal, the authors use an LSTM model modified to estimate the hand forces for each frame; therefore the given input data are video sequences with force measurements as ground truth. Just as for the action prediction task, the visual features from the video are obtained with a pre-trained CNN applied to image patches. The limit of the presented model is that it performs worse with objects having large variations in the way different subjects can move them.
Recently, Soltanian and Ghaemmaghami [39] exploited transfer learning by applying image trained descriptors at the frame level of video sequences. The objective is to make the event recognition task possible in scenarios with limited computational resources. The authors propose a CNN-based event recognition method leveraged on both the hierarchical CNN concept scores post-processing and the concept-wise power-law normalization. A CNN, pre-trained on ImageNet, is fine-tuned using a subset of training video frames in order to obtain CNN frame descriptors. First of all, the resulting descriptors are post-processed to improve the mid-level video representation, taking into account the hierarchy and relative shortest distance of concepts in the WordNet concept tree. Secondly, a new concept-wise power-law normalization – in which different normalizations are applied to different features according to their statistics on training data – is used to improve recognition accuracy. Since the method uses only spatial descriptors, the limit of the proposed method comes up when temporal cues become significant for the event recognition. Lee and Lee [124], focusing on cost efficiency for practical applications, propose a complex human activity recognition method, suitable for partially observed videos, based on pre-trained deep representation. The underlying idea is that if a video is partially observed, then a good representation of the given video is more important than the temporal dynamics of the actual activity. In detail, the complex ongoing action is represented using publicly pre-trained CNNs to extract frame-level image features, while considering the pairwise interactions between individuals and their participation ratio in the overall scene. The relationship between objects is constructed by exploiting not only the appearance of the object, but also both the local and global motion activations. Once the descriptor is created, an SVM is used to predict the activity label. Experiments have shown that the proposed method works better on the interaction between moving objects rather than with non-moving ones.
Differently, Shri and Jothilakshmi [168] exploit the CNN for crowd video event classification. They propose the use of a CNN for frame-by-frame video content exploration and deep feature extraction. In this way, the crowd event classification can be performed based on key-frames improving runtime performance at low cost. The CNN was trained on 4.000 frames of only four categories. The method needs to be tested on larger crowd event datasets.
Pang et al. [175], instead, propose a self-trained deep ordinal regression network for unsupervised anomaly detection in videos, consisting of a CNN for feature extraction and a fully connected neural network for anomaly scoring. Applying the self-trained ordinal regression allows the use of a weakly supervised strategy leading to the end-to-end learning. The method is weakly supervised because the video frames are initially pseudo-classified, as normal or abnormal, by using generic unsupervised anomaly detectors, and the CNN is a ResNet-50 [183] model pre-trained on relevant auxiliary data. The pseudo-classified frames are the input for the proposed network, that iteratively updates and enhances itself by recomputing anomaly scores and updating pseudo-classification accordingly, in a self-training fashion. However, this method does not consider motion information, then the set of detectable anomalies is limited.
Also, Doshi and Yilmaz [177] recently focused on unsupervised anomaly detection, but for traffic monitoring applications. Given a traffic video, they consider the presence of stationary vehicles an anomalous event. The proposed method consists of three main modules: preprocessing module, candidate selection module, and backtracking anomaly detection module. The preprocessing stage detects and classifies stationary objects in the video by using the you only look once (YOLO) [184] model pre-trained on MS-COCO dataset. Because they are interested only on vehicles, the candidate selection stage aims to remove the misclassified objects, by using the nearest neighbour approach, and select potential anomalous regions, by using a clustering-based strategy. Finally, the backtracking anomaly detection stage computes the structural similarity (SSIM) measurement of each region of interest between its onset time and each time instance from the start of the video. If the SSIM is high, then a stationary vehicle appears in the frame. If this measure increases over several frames, then the time instance where the SSIM crosses a certain threshold is declared as the onset time of the anomaly. Actually, this method is not capable of learning different type of anomalies.
Zhang et al. [179] exploit the end-to-end learning and weakly supervised scheme for collective activity recognition. The method is weakly supervised because it uses only the bounding boxes of people detected in the videos and activity labels. The authors propose a fast deep learning architecture that jointly models the person detection and collective activity recognition tasks, reinforcing each other. Indeed, such tasks share the feature extraction part, aiming to reduce the computational cost and remove people not involved in the activity execution. Therefore, the network architecture consists of two main components: person detection and collective activity reasoning. The former is a custom version of the region proposal network (RPN) introduced in Faster R-CNN [185] that returns region of interests representing the persons detected and scene bounding boxes. The latter uses the visual features of the people and scene as the latent variable embeddings, updating them by using a mean-field like procedure to capture more abundant person interactions classified with a softmax function. This method exploits only visual information to perform the classification, but also motion information could be useful to improve collective activity recognition.
Discussion
This section provides observations on types of features and machine learning algorithms currently used for the automatic event recognition task. Moreover, it contains two summary tables that provide a concise overview of some central aspects (i.e., features, algorithms and datasets) concerning the works in Section 4. The discussion aims to help the reader choose the best combination between features and algorithms to develop a specific ER system.
By analysing the most important and recent works regarding the current state-of-the-art event recognition, it is possible to observe that trajectories and motion vectors are used to manage a wide range of events. These features can be computed suitably and easily on almost every type of moving subjects (e.g., people, vehicles or animals), without high computational costs. Minimizing computational costs is very important for an ER system implementation, especially if it is designed for real-time use, such as to monitor public (e.g. streets or squares) and private (e.g. military zones or warehouse) areas. However, even if the analysis of trajectories or motion features are widely used to understand automatically events in which rigid-subjects are involved (e.g., vehicles); for some applications could be interesting to analyse such subjects also granularly (e.g., fault tolerance test). In general, when an ER system is designed for non-rigid subjects, it is a good idea to consider the information of both the whole subject and the set of its sub-parts. The greater availability of information, concerning the movements of non-rigid subjects, increases the set of events that can be recognized, but also the complexity of the recognition task. If, for example, an ER system is only capable of processing information about the whole body of one subject, only a small set of simple events can be recognized. Processing all the information concerning the interaction among more subjects, instead, allows to recognize different types of events, including the type of relationship between two subjects and the attitude in a crowd of people. The increasing complexity of the ER system, aiming to process sub-parts of the human body (e.g. hands or arms), allows to classify a greater number of subject behaviours, including the classification of emotional states (e.g., anxiety, happiness or fear). Based on the abovementioned observations, in recent years there has been an increasing interest in depth sensors technology, through which it is possible to obtain the skeleton of human subjects easily. The possibility to work with dynamic information (e.g., position, speed or acceleration) of specific points of the human body (i.e., skeleton joints), is a real step forward in the automatic interpretation of human behaviours. From a technical point of view, it is not of particular interest to identify which kind of sensors have been used to obtain the skeleton model of the subjects that act within an area of interest, since currently this task can be achieved by different technologies, including RGB cameras. However, during the implementation of a real ER system, several aspects and related limits have to be considered. For example, a single depth sensor works well in an indoor environment, and the information acquired is sufficient to create the skeleton model of the subjects involved. On the other hand, two (or more) depth sensors cannot be placed facing each other, because this would generate interferences. Besides, in general, the spatial resolution of these sensors is lower compared to the RGB sensors. Finally, the consumer depth sensors commonly reach a weak performance in outdoor environments due to their sensitivity to solar light. Differently, the RGB sensors work well in outdoor environments but, in general, they require calibration phases and the reconstruction of skeletons that, in real contexts, are neither simple nor granted. However, the selection of more suitable sensors is a matter that depends on several factors, including the properties of the scene, the type of events that requires to be recognized, amount of subjects, and so on.
Regarding the image descriptors, in recent years, their use for event recognition has grown considerably. This is due to the fact that most of the image descriptors on RGB videos are invariant to scale, rotation, translation and, partially, to illumination changes. Another advantage of image descriptors is that, with the modern hardware, they can be computed in real-time, allowing the design and implementation of reactive ER systems. Besides, they can be used to recognize any type of subject and extract several types of features (e.g., trajectories, patterns or salient points). Some drawbacks of image descriptors are related to the quality of the processed images and the matching algorithm. Therefore, the image descriptors approach can work correctly only if the source images have high spatial resolution without significant noises or distortions. Moreover, the source images should represent elements – for both foreground and background – with rich details in order to build a robust set of descriptors. Since each descriptor has a very complex structure, the matching process, aimed to identify correctly each couple of descriptors between two consecutive video frames, could be affected by different errors. In any case, this approach can achieve remarkable results through an accurate pre-processing phase and proper use of the descriptors, by taking into account the specific goals of the application context.
To perform event recognition notice that, in the reported works, the most used algorithms are clustering, SVM and neural networks, while the most used features are 3D data and deep ones. The clustering approach is generally used to associate the actions identified, or in some cases, the estimated poses of subjects, to the ones previously learned by the ER system during the training step. The SVM approach is usually applied both in single and in multi-class methods. Besides, it is also routinely used to check if the movements identified in the scenario trigger one or more events. Indeed, there has been an increase in the use of neural networks, as this type of approach can be applied for binary or multi-class classification tasks, and in a supervised or an unsupervised way. In the supervised learning paradigm, neural networks require a large amount of training data to obtain a good complex model trained to recognize events. In the case of skeleton-based methods, neural network models usually seem a natural choice for the processing of motion dynamics in the temporal domain and joints dependencies in the spatial domain.
Table 2 reports the types of recognizable events with reference works and datasets used for evaluation. As shown, the MSR collection and NTU RGB+D datasets are the most used by the approaches that make an extensive use of 3D data. Instead, classical approaches that process RGB videos, typically use the UCF50 and HMDB51 datasets. Finally, Table 3 correlates application fields, detectable subjects, used features, and applied machine learning algorithms. This table is of particular interest because it highlights the set of features and algorithms used to treat several application fields and, for each of them, it highlights a set of selected works that can be considered immediately by the reader in order to explore the related topic. Moreover, useful information is given by the connection between the set of features and the related set of machine learning algorithms used.
Conclusions
In recent years, the increased use of video sensor networks has promoted the development of smart algorithms capable of understanding the semantic content of videos. These algorithms aim to understand automatically the sequence of actions performed by different subjects within the scene observed. These actions represent simple or complex events, depending on the type of interaction among the subjects involved. This survey provides formal definitions of scene and event, reasoning on event effects, and the description of a logical architecture for a generic ER system. Moreover, it describes the state-of-the-art works that, in our opinion, are cutting-edge in the event recognition task. It also presents two taxonomies used to describe the selected key works concerning the event recognition task. The first taxonomy classifies the types of features used to describe events, while the second one classifies the machine learning algorithms used to recognize events. This survey even highlights the different limits of the works reported and provides the list of datasets used to evaluate the ER systems. Finally, the last two tables summarize some key aspects of the ER methods discussed.
Footnotes
Acknowledgments
This work was partially supported by both the ONRG project N62909-20-1-2075 “Target Re-Association for Autonomous Agents” (TRAAA) and MIUR under grant “Departments of Excellence 2018–2022” of the Department of Computer Science of Sapienza University.
