Abstract
This work proposes a key pose based intelligent system for recognition of human interactions from video streams. In addition to interaction recognition, the task is useful for some of other applications like content based video retrieval. The main idea is to use the shape of the bilateral silhouette between the persons and analyze it using shape context descriptor, which is one of the popular shape descriptors in object recognition and matching tasks. At first, a dictionary from random samples for the whole classes is collected and the bilateral silhouette image is extracted for all samples and classes to train the low level classifier named frame classifier. Then, the frames of test sequence are compared with these samples and labeled as one class using frame classifier. Finally, a high level classifier is used to categorize the interaction as a function of predefined labels of frame sequence. We call this classifier as the sequence classifier. Because of probable errors in foreground extraction, some faults may occur in frame classification. Moreover, each interaction sequence is composed of two types of frames, which contain related or unrelated information about interaction. To tackle the problem, a normalized histogram of the frame labels is used as the action descriptor, which is robust against misclassification of some frames. This histogram is applied to a sequence classifier like random decision forests (RDF), Probabilistic Neural Network (PNN) or Support Vector Machine (SVM) to perform interaction recognition. Experimental results on SBU and UT-interaction dataset emphasize the privileged performance of the proposed method.
Keywords
Introduction
Human action recognition (HAR) is one of the important computer vision tasks due to its enormous applications like video content analysis and retrieval [9,12,16], human computer interaction (HCI) [41], healthcare systems [10] and autonomous video surveillance [13,24,44]. This is why the action recognition task has attracted a great deal of research works and researchers over the last decades. Human interaction recognition (HIR) is another type of HAR and here, the artificial intelligence system will focus on the interactions between two or more persons. Recognizing human interaction is more difficult than action recognition because of the occlusion of body parts during interaction and articulation ambiguity. While, in the most of action recognition datasets, the sequences are captured in front view of a person so that two hands are visible and there is no occlusion, in interaction recognition scenes, we usually lose half of each person’s body because we can see just one side of them during interaction. Moreover, the vastness of actions types and usual errors in interaction recognition is more than that of the action recognition. Another challenging problem of HIR is the shortage of large training databases. Since the most concentration of the researches has been on the Human action recognition, there are not enough datasets available for HIR. Most of the interaction datasets have been provided by only two actors which are not enough for a comprehensive dataset. We know that each person has his/her own motional characteristic. Therefore, in order to develop an all-inclusive dataset, we need more than two actors to play the same role. One useful dataset is SBU kinect interaction dataset, which is composed of 21 sets of two actors with 8 interactions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands from seven participants [43]. There are three important aspects, which make SBU a challenging dataset. The non-periodic action performed by actors is the first one, which makes hard to recognize them compared to periodic actions. The other aspects are similarity between body movements in different action categories and diversity of the actors that perform the specified actions with different execution speeds. This dataset is recorded with a

An example of bilateral silhouette for a boxing frame. To extract this shape from the foreground image, the actors are connected to each other by drawing a straight line from first actor’s head to the second one’s head. The space between the actors is highlighted as bilateral silhouette.
Afterward, shape context is used to compare shapes with pre-defined templates [3]. Selected frame is labeled as one of the classes and this procedure will be applied on all of the consecutive frames. Then, a sequence of labels is extracted and histogram of the labels in the sequence is computed. These histograms are classified using standard classifiers. To increase the recognition rate, classification is completed in two stages making use of frame classifier and sequence classifier. Sequence classifier is used because the frame classifier is not able to defeat the problems like bilateral silhouettes failure, classification errors and many other problems. This is why we can’t rely on frame classification. The key advantage of bilateral silhouette analysis is its low computational cost, because the system will analyze a 2D binary shape instead of 3D skeletons or color images. Speed and accuracy of the system depends on the dictionary size, image resolution and shape context matching parameters like the number of sampled points from boundary image. To speed up the algorithm, we need to create smaller dictionary of the poses or use fewer sample points for shape matching. On the other side, to achieve good performance on recognition and get a more accurate system, we should use an enlarged dictionary and more sample points. It means that there is a trade-off between real-time performance and accuracy. Unlike some of the state-of-the-art approaches, no pre-processing stages like motion segmentation and person tracking algorithms are needed. Removing such pre-processing stages helps us to prevent probable failures and errors and the feature extraction is done just in one stage. Usual techniques like Optical Flows, Oriented Gradients and Spatio-Temporal Interest Points have been frequently applied in action and interaction recognition tasks and in some of the studies, near-perfect results have been reported. But, the mentioned methods suffer from high computational complexity. Finally, we perform a number of experiments to evaluate the accuracy of shape context descriptor. We carry out experiments to evaluate the algorithm on SBU and UT dataset. We have tested three different types of classifiers in sequence classifier stage and the results are illustrated in confusion matrices. There are various types of cross-validation protocols such as Random subsampling, K-Fold and Leave-M-Out, used to split the action/interaction datasets into training and testing subsets. We choose Leave-One-Out (LOO) method in which, each experiment uses one example for testing and the remaining examples for training. LOO evaluation scheme is ideally suited for sparse datasets because it is important to train on as many training samples as possible in such datasets.
The remaining part of this paper is organized as follows: Section 2 briefly studies related works on interaction recognition and reviews the main idea of each state-of-the-art work. Section 3 presents the proposed method in details containing feature extraction and classification methods. Our experimental results on SBU and UT dataset are presented in Section 4. Section 5 concludes the paper and gives a summary of this study.
Many works on human action/activity recognition mainly focus on realistic datasets, which are collected from Internet or movies, while the others use artificial databases prepared for specific purposes. Accordingly, extracted features in existing related studies can be divided into three major categories based on the nature of the database: audio features, textual features and visual features. Audio features are often used in movie-based database where the sequences are parts of a movie. Textual features are exploited in realistic databases i.e. YouTube-based databases, which contain some tags and titles or movies with subtitles. Visual features are the most popular features and are used in any kinds of databases. As a subset of HAR, we can categorize the task of HIR based on image representation methods into two main classes: local representation and global representation [32]. In local representation, a bottom-up approach is employed in which the observed image is represented as a collection of independent patches. In this method, some interest points should be extracted to calculate the local patches around them. On the other hand, global representations use top-down mode, which captures the visual observation as a whole. In global representation, we have to locate the actor(s) in the first step. To do this, we can use background subtraction or person tracking methods. Then, the region of interest will be described as image descriptors. It is important to locate the actor(s) with a fine accuracy. As an exemplary global representation, recognition of two person interactions is performed using a hierarchical Bayesian network (BN) in [27]. A low level of the BN is used to estimate the tracked body parts like head, arms, legs and a high level of BN is exploited to estimate overall body pose. Afterward, processing the evolution of the poses for the multiple body parts is done by a dynamic BN. In another study, the visual motion patterns are segmented at first and a set of middle-level components are generated by clustering keypoint-based trajectories extracted from the video sequence. Secondly, spatio-temporal relationships between pairwise components are defined. These pairwise components and middle-level components are the final features, which describe the motion characteristics of a video sequence [42]. A two-level analysis framework is proposed in another work to perform human activity recognition with distributed camera sensors in different fields of view. In track-level analysis, the gross-level activity patterns are analyzed and in body-level analysis, person’s activity is analyzed in terms of the individual body parts’ coordination [28]. Natural language is a model used to define the relation between two person at a detailed semantic level and divid human interaction to single person actions composed of torso and arm/leg motions. The person’s action is explained in terms of
Several approaches combine local and global representations to improve the accuracy of recognition [14]. The method represents a model, which fuses large-scale global features and local spatio-temporal features. They used optical flow as the global feature after pedestrian detection. To describe the local features, they detected spatio-temporal interest points and computed gradient descriptors of these points.
There are also works taking advantage of dictionary and key frames to help the task of video-based datasets classification e.g. action recognition methodology introduced in [7]. Most of these key frame and key pose based approaches tackle the problems of key frame extraction using automatic methods. There are papers where the problem of key frame extraction is addressed by using visual features e.g. [37]. In this work, a method is proposed for key frame extraction based on intra-frame and inter-frame motion histogram analysis. In other words, the key frames are selected based on the complexity of motion. In another study, saliency-based visual attention model is introduced to extract key frames [20]. In a different category, a method of key frame extraction is proposed using dynamic Delaunay graph clustering by the use of an iterative edge pruning strategy on YouTube and 100 videos from the Open Video databases [19]. An earlier work, extract key frames using color histogram, moments of inertia and some visual features extracted from the correlation of RGB color channels [15]. The results reveal that the extracted key frames are close to the key frames drawn by human. We used a key pose extraction algorithm inspired by the work of [1]. This key pose extraction method is based on k-medoids clustering algorithm. The cluster medoids represent the common poses in each action and they have chance to be the key poses.
As a final point, it is realized that there are many researches concentrating on audio, visual and textual features, which are proposed in different applications of interaction recognition. The employed algorithms are different according to the application. For example realistic video categorization is quite apart from surveillance video classification or kinect video categorization. Since we use a top-down scheme, our work can be categorized in global representation group. So, major differences are expected with the studies introduced in local representation category such as [21] and [39]. The main workflow of the proposed methodology is almost like the one introduced in [7]. There are also significant differences from database and feature selection perspectives. They use a simple distance-based feature to perform action recognition, while we use shape context as a powerful descriptor on interaction recognition datasets. They use K-means clustering with Euclidean distance to extract key poses, but we use K-medoids key pose extraction method, which is more accurate and robust due to its ability to rank the potentiality of each key frame candidate [1].

An overview of our proposed approach.
Overview of methodology
In this part, the proposed algorithm will be introduced briefly in three main phases, which will be discussed in details in the next sections. Our proposed system is illustrated in Fig. 2 in a graphical abstract.
In the first step, a dictionary from the bilateral silhouettes is created (see Fig. 1). To achieve this end, some important key frames, which contain distinguished motion, are extracted. These key frames are extracted in two ways: automatic and manual. In the frame classification stage, the extracted frames are compared with the dictionary using shape context matching methods and classified as one of these classes in (e.g. SBU) dataset: approaching (close), departing (far), pushing, kicking, punching, exchanging objects, hugging, and shaking hands, where each class label is specified with the numbers As the sequence classification stage, these histograms are classified in the second classifier, which recognizes the label from the whole video. The sequence classifier will help us to classify variable-length videos in addition to defeat probable errors.
Bilateral silhouette extraction
The bilateral silhouette is a novel representation, which is robust to partial occlusions like one hand occlusion. The bilateral silhouette captures the space between the actors while the occluded body parts like left hand will be removed automatically (see Fig. 3).

Bilateral Silhouette is robust against partial occlusions, because the most important part of the scene is the space between the persons, not their occluded body parts.

Foreground extraction process using GMM clustering. There are some post-processing stages to extract the pure foreground image.
The method is also robust to small shape deformations. Shape context matching descriptor is insensitive to small perturbations as it is proven in [3]. Since we use shape context descriptor as the basis of the proposed approach, the algorithm will be robust against deformations due to the descriptor’s robustness. For example, small deformations of head does not affect while performing a kicking interaction. To extract bilateral silhouette, the background is removed and the actors are connected to each other by drawing a straight line, which connects two persons’ head. To draw the line, the topmost points of each person’s silhouette are connected to each other. Then, the spatial gap between the actors is highlighted as the bilateral silhouette, which aims to capture the relation between their actions. The foreground is extracted using a Gaussian Mixture Model (GMM) Clustering method, which gives a good result in slowly moving foreground objects. After the clustering stage, some morphological operations will be needed to get a pure foreground image. Then, the XOR image of the foreground and convex hull of the foreground image is computed as the bilateral silhouette (see Fig. 4).
Obviously, trying to find a particular data in some sort of summary is easier than searching a huge database. We have to summarize the video to save in computation time of data retrieval. Key frames extraction is the basis of video summarization and retrieval, which selects some of the most informative frames in a video to reduce the quantity of records required in video indexing. Video summarization can result some still images (key frames), or some moving images (video skims) [40]. In fact, the key frames are the selective salient frames, which contain semantic importance. On the other hand, video skims are selected meaningful video shots. Since organizing still images for further browsing is easier than video skims, we select key frames to form our dictionary. To achieve good performance on recognition, the selected frames must convey useful information and add a new shape of the desired action to the dictionary. Picking up of duplicated frames may cause a decrease in speed without any advance in accuracy. There are two ways to extract a meaningful dictionary from samples of whole classes: Automatically selected frames and manually selected frames.
In this work, we extract the key poses using both methods and compare the results of the methods. In the manual method, the computational complexity is less than the automatic methods. The selected frames should be different and number of frames per sample depends on diversity of human poses while performing interaction. To illustrate the efficiency of shape context based frame classifier, we separated training data from the test data, which means that the dictionary will not contain key frames from all of the samples. SBU dataset consists of 21 subsets in which each subset is performed by a different pair of actors. We selected about 40% of the subsets to learn the dictionary and no key frames are extracted from the remaining subsets of the dataset. Eight subsets for eight classes are selected to learn the dictionary (see Table 1). Similarly, on UT dataset, there are ten sequences for each class performed by different actors. We selected seq1 to seq4 of each class to learn our dictionary (see Table 1).
Subsets used for training and testing dictionary
Subsets used for training and testing dictionary

Some of important frames are selected for each class and the collection of these frames will form the dictionary. All of the key frames will participate in the classification stage.
The size of the dictionary used in frame classifier has been experimentally adjusted to 273 key frames in SBU dataset. We have also selected seq1 seq4 from each set in UT-interaction dataset, which will give 149 key frames. An example of dictionary for three samples is shown in Fig. 5.
Since we don’t have approaching and receding classes in UT-interaction dataset, we have modeled these classes. In fact, we have eliminated these frames which occur before and after the desired interaction (see Fig. 6). Using a distance criterion, we can recognize if the actors are standing far or close. We can also determine if they are getting closer or they are moving in the opposite direction.
In automatic key frame extraction, the per class key poses are obtained by k-medoids clustering with Euclidean distance similar to [1]. These key poses will apparently demonstrate important parts of the interaction sequence. Automatic key frame extraction is usually applied on the action databases, which have just one action per sample. On the other side, we know that occurring of interaction needs more than one actor and the actors should get close, interact and probably depart. We know that, in an interaction databases, each interaction consists of three kinds of actions: approaching, performing interaction, departing. This is what happens in reality. As a result, in order to obtain key frames from interaction databases (e.g. SBU), we need to eliminate approaching and departing key frames from each category. To overcome this problem, we use shape context descriptor to compare the key frames of the categories

Finding key poses
Developing of the shape matching and recognition algorithms has gained great interest in recent years. An early study have introduced a 2D shape matching method to measure similarity between shapes, which is used in many applications like object recognition [3]. Because of its efficiency, the method has been extended to 3D shape context in [18]. The descriptor is invariant to translation, scaling and occlusions as it is indicated in [3]. It is also robust to small geometrical distortions and presence of outliers and can be made rotation invariant using local tangent orientation. Here is a brief review of shape context shape matching:

Approaching and departing classes are defined in UT-interaction dataset.

Shape context descriptor (down) is computed on the selected points from boundary image (up). The descriptor is a
In the first step, the bilateral silhouette is represented as a discrete set of points, which are sampled from the external or internal edges on the binary image:
Therefore, after foreground extraction using current methods, contour of the bilateral silhouette is extracted using a robust edge detector like canny algorithm [6]. The sampled contour points should be selected always in the same order. For example, the topmost point will be considered as the first sample and the next points will be selected in a clockwise order. Then, a coarse histogram for each point
This histogram is called shape context of pi, which is computed in
After extracting shape context histogram, the cost of matching point
Where,
Subject to the constraint that the matching is one to one, i.e., π is a permutation [2]. In low level classification where each query frame is compared with the samples of dictionary, a simple voting scheme will be applied to select the most relevant action label to the query frame.
We select 100 points from the boundary, which are distributed uniformly. The output of shape context is the number of points matched to the desired point. To perform classification, each query frame is compared with the key frames of each eight classes (Fig. 5) and the number of matched points is saved. For each interaction in the dictionary, ten greatest values of the saved numbers are selected and averaged. The output label is the label with maximum quantity of these eight values. As a result, this step converts a video sequence to a sequence of labels for eight classes (see Algorithm 2). One significant aspect of our proposal is the computation of the output sequence histogram. The normalized histogram of the labels is computed to show the distribution of action labels in a video sequence. Figure 8 illustrates the normalized histograms for actions performed in Fig. 9. Samples in Fig. 9 are ideal samples and obviously there will be errors in low level classification step, but the second stage of classification will defeat these noises and misclassifications. For example in boxing class, some of the frames may be classified as giving something or punching class and misclassification percentage depends on dictionary richness.

Frame classification

Histogram of classified output labels for the samples in Fig. 9. Non-normalized histograms are represented to show number of labels.

Low level classification results for one sample per class. After extraction the output sequence, histogram of the labels will be computed as a feature.
All of the samples in interaction datasets are composed of different interactions. For example, a kicking sequence is composed of approaching, kicking and receding classes with a partly specified distribution. So, we need another classifier to employ these data and recognize the interaction label from the histograms. In fact, recognizing an interaction in a video sequence using just one frame is an ineffective effort. Moreover, there may be errors in frame classification block between similar classes like hand shaking and giving object. In order to defeat noises and misclassifications, a sequence classifier is used in the next stage. To do this, three kinds of current classifiers are trained: random decision forests (RDF), probabilistic neural networks (PNN) and support vector machines (SVM).
Different kernel function can be defined to form a more flexible classifier, which can separate non-linearly Separable Problems. SVM works well with fewer training samples in comparison with other classifiers. This is an important feature in video processing applications e.g. HIR, because the classifier will be trained to further classifications (after testing process), which makes test samples much larger than training samples.
Experimental results
In this section, we present the experimental results of our approach tested on SBU human interaction dataset and UT-interaction dataset. We don’t aim to compare our classification results on SBU dataset with the state-of-the-art studies (except [43], which uses the same dataset as we use), because the most important factor, which allows us to compare two algorithms is the ability of testing the methods on the same dataset. The experiments demonstrate the effectiveness of our proposed method for two-person interaction categories in SBU dataset. To the best of our knowledge, SBU dataset was only used in [43]. The mentioned research used depth images of the dataset as the main data. Since SBU dataset has RGB and Depth video sequences separately, we used RGB data as the main information. We have also tested and compared the results obtained in our experiments on UT-interaction dataset with recent results published on this dataset.
SBU interaction dataset
In a PNN classifier, a key parameter to control the learning process is spread value (σ). Picking a large spread value may cause loss of details and then, the model will not be able to fit the function accurately. On the other hand, if the selected spread value is too small, over fitting will occur. To find an optimal σ, a set of values from 0.1 to 4 were examined. From Fig. 10, one can see that the small spread values give better classification accuracy and the success rate slowly starts to fall as the spread value increases.

This graph shows the change in the recognition accuracy on the SBU dataset with respect to spread value in RBF kernel of PNN classifier. The highest accuracy is obtained when we set the spread value to 0.1 and the performance decreases fast when the value is larger than 0.1.
A PNN with spread value (σ) of about 0.1 had a good performance. The testing process was repeated 30 times for each value of σ. We selected 0.1 as σ step because with smaller steps (e.g. 0.01), there was not a significant change in recognition rate and it was quite stable.
Figure 11 contains the results of a LOO application for SBU dataset using PNN classification.

Confusion matrices of PNN classification using LOO cross-validation. The average of all categories we achieve in SBU dataset is 93.09% with manual dictionary extraction (left) and 80.86% with automatic dictionary extraction (right).

Confusion matrices of SVM classification using LOO cross-validation. The average of all categories we achieve in SBU dataset is 95.05% (left) using a third degree polynomial kernel in manual dictionary extraction, which is the highest recognition rate in our approach. Recognition rate in automatic dictionary extraction is 87.23% (right).
In each cross-validation round, one video sequence containing two particular actors are used as test set, while the rest of the video sequences are used to train the classifier. The results show the average performance (success rate %) of 93.09% and 80.86% for manual and automatic key frame extraction methods, respectively. Accuracy is defined as the ratio of the correct classified samples to all of the samples. The results are obtained after completion of about 150 training and testing trials.
We also train a one-versus-one multiclass SVM classifier using the LibSVM toolbox, which supports well-known kernel functions [8]. Best results were obtained by a third degree polynomial kernel. Figure 12 illustrates the results of SVM classification using a third order polynomial kernel.
The average interaction recognition accuracy for the eight classes of interactions with manually selected key frames is 95.05%, where the recognition rate achieves 87.23% with automatically extracted dictionary. As illustrated in Fig. 12, three classes are classified with a success rate of 100% and the other categories are classified with a correct recognition rate of over 91%, except boxing interaction. The worst performance was noticed in the last category, which is about 87%.

Confusion matrices of SVM classification using LOO cross-validation. The average of all categories we achieve in SBU dataset is 94.11% using a RBF kernel in manual dictionary learning mode (left) and 82.66% in automatic dictionary learning technique (right).
The average recognition accuracy of RBF kernel with manually selected key frames is 94.11%, which is close to the results obtained by polynomial kernel (see Fig. 13). In SVM with RBF kernel, receding is confused with approaching (the lowest recognition rate) since they have similar bilateral silhouettes. It can be observed from Figs 12 and 13 that the SVM classifier, which uses polynomial kernel, performs much better than the SVM classifier using RBF kernel. As explained before, various kernel functions have been developed for SVMs like neural network kernels, spline functions, polynomial functions with different degrees, RBF kernels, and Dynamic Time Warping (DTW) kernels. We have tested the effects of different kernels on the dataset using SVM classification.
Using RDF as a classifier, the method achieved accuracy of 93.17% using manually selected dictionary. The same setting as PNN and SVM classifiers (LOO cross validation) is used to test proposed methodology on interaction recognition using SBU interaction dataset. Most of confusions are due to similarity of the performed interactions and confused classes are similar to the confusions in SVM and PNN classifiers. Figure 14 indicates the results obtained by a RDF with 500 trees.

Confusion matrices of RDF classification using LOO cross-validation. The average of all categories we achieve in SBU dataset is 93.17% using 500 trees in manual dictionary learning mode (left) and 87.77% in automatic dictionary learning technique (right).
Our method outperforms the algorithm introduced by [43], which has the average recognition rate of 91.1% using MILBoost classifier, whereas our proposed method achieved a rate of 95.05% on the same dataset. Note that they use depth information and extract an articulated skeleton for each person, which is a more complex and time consuming feature in comparison with our 2D shape context feature. Then, they define some joint, plane and velocity features based on the extracted skeleton in the feature extraction stage. They model a MILBoost classifier for each action with true positive and true negative instances. The mentioned classifier is a well-known model in which each instance is classified by a linear combination of weak classifiers. This model is a more complex and exhaustive classifier than our two stage classifier, but we have concentrated on the feature extraction process to get a faster classification procedure, which may help us to perform real-time classification. In the classification procedure, some errors occur when receding action is confused with the approaching action (1st and 2nd classes). This is because of the similarity between some of transition frames while switching classes from far to close, or vice versa. It means that there is not decisive definition to separate the statements far and close. Since the confusion is caused in feature extraction block and it does not depend on the classifier type, we expect to have similar errors while using any type of classifiers. In some cases, confusion occurs between hugging and receding/approaching due to the similarity of the bilateral silhouette. In some cases, there are misclassifications between punching and boxing actions (4th and 8th classes). One possible reseaon is that some frames in the mentioned classes may not visually be separable. The base movement in these interactions is the first actor’s hand moving toward the second actor. The problem also happens between shaking and giving something actions (5th and 7th classes).

Confusion matrix for foreground silhouette images with manual dictionary extraction method.

Confusion matrices of set 1 and set 2 for UT-interaction dataset using RDF classifier.
To illustrate the effectiveness of the bilateral silhouette in comparison with the foreground silhouettes of the persons, we tested the shape context matching algorithm on pure foreground silhouette images. A new dictionary is defined using the foreground images and the low-level classification is done by matching the query foreground images with dictionary samples. In fact, the process is the same as the method explained before. But instead of bilateral silhouettes, the foreground binary image is exploited (left part of Fig. 1). The best recognition rate in sequence classifier was obtained by 2nd order polynomial kernel in SVM classifier, which is about 77%. The confusion matrix is reported in Fig. 15.
We have tested our proposed algorithm on UT-interaction dataset using PNN, SVM and RDF classifiers, but we have reported confusion matrices for the best results, which are obtained from the random forests classifier. All of the stages are similar to SBU dataset evaluation process and the validation details are same as the facts discussed in SBU dataset section. The average recognition accuracy for UT-interaction dataset using PNN classification with a LOO setup with the manually selected dictionary is 88.33% (90.25% for set 1 and 86.41% for set 2). Using SVM classifier, the average recognition accuracy with RBF kernel (offering highest results among other SVM kernels) is about 81.67% (84.2% for set 1 and 79.14% for set 2). These recognition rates are an average value of ten runs. Figure 16 illustrates the results of RDF classifier for set 1 and set 2, separately. In set 1 (left), the confusion occurs between classes 4 (boxing) and 5 (punching). In set 2, interaction 7 (shaking hands) is added to the above mentioned confusions. Obviously, this is because of similarity between performed interactions. The results demonstrate that the method can achieve 95% accuracy for set 1 and 90% accuracy for set 2.
Comparison of the correct recognition rate in (%) for different methods using two datasets
Comparison of the correct recognition rate in (%) for different methods using two datasets
Table 2 summarizes the results achieved by some of the state-of-the-art studies using UT-interaction dataset. Some of these studies are person-centric methods, which may need some complex pre-processing stages e.g. person detection and tracking, making the interaction recognition process too slow [25,30,36]. But we just need a simple foreground extraction as the pre-processing stage. Some of the proposed methods use bag of words algorithm combined with other features such as spatio-temporal features to represent the video sequences [23,34,42]. Such features are fast enough but they need a post-processing stage to refine the features, while our proposed shape context feature does not need any post-processing. These three methods have also used SVM-based classifiers, which is similar to our approach. The results show that our proposed algorithm, which gives the highest recognition rate on SBU dataset (95.05%), is also among the best results obtained by the state-of-the-art studies on UT-interaction dataset. The most common misclassifications are between boxing, pushing and shaking categories. The probable reason is visual similarity of the interactions, which have been discussed in SBU dataset.
Most of the action/interaction recognition applications require real time or at least a fast processing procedure. In terms of time complexity, in the proposed method, the frames will be compared to the dictionary using a point matching algorithm, which takes about 0.007 s to get the matching cost parameter (
Conclusion and future work
In this paper, we introduced a bilateral silhouette based interaction representation and explored its capability in two person interaction recognition. A binary shape called bilateral silhouette has been extracted, which captures the essential information of each actor’s body pose. Also, a label has been assigned to each frame using a Shape Context matching method. The problem has been formulated as a supervised learning task using PNN, RDF and SVM. The bilateral silhouette based interaction recognition method achieved 95.05% for SBU dataset and 92.5% for UT-interaction dataset as the average rates. Experiments demonstrate the discriminative power of the new methodology for interaction recognition and show that our method can be a reliable solution to such video processing applications. The main strong point of our proposed method is the fast solution, which uses RGB data instead of Depth or Motion Capture information. It is obvious that using RGB images needs cheap and simple hardware, which is a simple camera and in the proposed method, the resolution and quality of the camera is not so important. Using RGB image is also of great potential to attain substantial computational saving compared with depth images. In the future study, we intend to discover the possibility of using similar features to speed up the process in order to have a prevision of the action. We can also use depth data from SBU interaction dataset to improve our classification accuracy.
