Abstract
The most widely used two-stream architectures and building blocks for human action recognition in videos generally consist of 2D or 3D convolution neural networks. 3D convolution can abstract motion messages between video frames, which is essential for video classification. 3D convolution neural networks usually obtain good performance compared with 2D cases, however it also increases computational cost. In this paper, we propose a heterogeneous two-stream architecture which incorporates two convolutional networks. One uses a mixed convolution network (MCN), which combines some 3D convolutions in the middle of 2D convolutions to train RGB frames, another one adopts BN-Inception network to train Optical Flow frames. Considering the redundancy of neighborhood video frames, we adopt a sparse sampling strategy to decrease the computational cost. Our architecture is trained and evaluated on the standard video actions benchmarks of HMDB51 and UCF101. Experimental results show our approach obtains the state-of-the-art performance on the datasets of HMDB51 (73.04%) and UCF101 (95.27%).
Introduction
Human action recognition [17,32] is an active research topic in computer vision, which is widely used in video surveillance, virtual reality and behavior analysis, etc. Significant progress has been made in the study of human action recognition with the emergence of deep convolutional neural networks (CNNs) [26,30] and large labeled datasets [24,26,35]. The challenge for human action recognition in videos is to capture the inherent temporal dynamics exhibited among continuous video frames.
There are three main ideas proposed to extract temporal information from videos. 1) Simonyan and Zisserman [33] introduced a two-stream network architecture in which a spatial stream processes RGB frames to model appearance and a temporal stream learns features from optical flow image sequences to capture motion information. 2) Du Tran and Lubomir Bourdev [38] proposed a C3D architecture that directly extracts appearance and motion information from RGB frames. 3) Donahue et al. [10] presented a Long-term Recurrent Convolutional Network (LRCN) that leverages CNNs and LSTM to learn long-term dependencies. Despite significant efforts which have been made in the last decade for human action recognition, the major challenge remains in finding an effective network architecture to capture spatio-temporal information between neighboring video frames. Deep learning based methods have recently witnessed a remarkable success in many applications including human action recognition in videos. Most of the recent work [2,31] is based on a combination of two-stream methods with 3D convolutions. However, both spatial stream and temporal stream use 3D CNN architectures will be computationally expensive. Whereas, the combination of two-stream methods with 2D convolutions would lose some spatial-temporal information between video frames.
In this paper, we proposed a novel Heterogeneous Two-Stream Network called HTSNet, which uses a mixed convolution network (MCN) to train RGB frames and a BN-Inception [16] architecture to process optical flow sequences. Inspired by Temporal Segment Networks (TSN) [43], we divided each video into N segments. Considering the redundancy of video frames in each segment, we use a sparse sampling strategy and randomly select several frames in each segment as our inputs to HTSNet to decrease computation cost. The TSN network chooses the BN-inception as building block, the spatial stream and temporal stream have the identical structure, and the two streams are trained independently, the class scores of different segments are fused to produce the final prediction. Different from TSN network, our proposed architecture uses two different building blocks to train RGB and optical flow frames, respectively. Moreover, we construct a network structure with mixed 2D and 3D convolutions to learn spatio-temporal features across RGB frames. Our network model is pre-trained on Kinetics dataset [24], the proposed approach achieved the state-of-the-art performance on the standard video datasets UCF101 [35] and HMDB51 [26].
The contributions of this work are as follows:
We proposed a mixed convolution network called MCN. MCN and BN-Inception are used to build a novel heterogeneous two-stream architecture called HTSNet which can capture joint spatio-temporal features for human action recognition in videos. The strategy of sparse sampling is used in our proposed HTSNet to capture the long-term context and decrease computation cost. Experiments on two standard benchmark datasets show that the proposed heterogeneous two-stream network model is effective and outperforms existing methods.
Related works
Two-stream 2D CNNs
Two-stream CNNs are first presented by Simonyan and Zisserman [33] for human action recognition. The two-stream architecture consists of one spatial-stream and one temporal-stream, which are 2D CNNs. The spatial-stream takes RGB video frames as inputs and the temporal-stream takes stacked optical flow frames as inputs, the prediction scores from both streams will be fused to produce the final accuracy. However, using 2D convolution to capture spatio and temporal joint information has natural defects and its predicting accuracy is limited.
To capture long-range temporal dynamics, Wang et al. [16] proposed a Temporal Segment Network (TSN) which is an improved two-stream 2D CNN and its building block is BN-Inception. TSN uses sparse sampling strategy to learn long-range temporal features over the whole video. We also adopt the sparse temporal sampling to extract random samplings of video frames, this can model long-range temporal structure effectively. There are some other methods which belong to two-stream 2D CNNs [5,25].
3D CNNs
3D CNNs can directly learn spatio and temporal joint features when applied on consecutive video frames. Shuiwang and Wei et al. [21] developed a 3D convolution and design a seven layer 3D CNN architecture for human action recognition, their method achieved competitive performance on the dataset of TRECVID and KTH. Du and Lubomir [38] proposed a C3D network which can extract appearance and motion information simultaneously. Moreover, they empirically found that
CNNs with LSTM
Since CNNs have achieved great success in image recognition [1,13,27,30] and Recurrent Neural Networks (RNNs) such as LSTMs can learn long-range temporal dynamics, it is natural to combine CNNs with RNNs for action recognition in videos. A basic approach is to add the LSTM cells in CNNs, which can encode hidden states, capture temporal sequences and long-range dependencies. Donahue et al. [10] proposed a Long-term Recurrent Convolutional Network (LRCN), which fed spatial features learned from a CNN into a stack of recurrent sequence models to learn long-term dependencies, the model has the ability of learning temporal dynamics and convolutional perceptual representations simultaneously. To further exploit spatio-temporal dynamics within video sequences, Chih-Yao Ma et al. proposed a model called Temporal-ConvNets for extracting spatio-temporal information [29] and compared the differences and the distinguishing factors between various models using RNNs in CNNs. Dhiman and Vishwakarma proposed a view-invariant two-stream deep human action recognition framework, which is a fusion of shape temporal dynamic stream and motion stream [8]. Their analysis validated that these methods require proper care to obtain state-of-the-art performance and have specific limitations.
Two-stream with 3D CNNs
Most recent works on human action recognition [4,22] are based on two-stream methods with 3D CNNs. The network architectures of these methods also include two streams, i.e., spatial stream and temporal stream. However, these works choose 3D convolutions as building blocks. Andrew Zisserman et al. [2] presented a Two-Stream Inflated 3D ConvNet (I3D), which uses 3D Inception-V1 as building block and considerably improve the state-of-the-art in action recognition in videos. Zhaofan et al. [31] proposed a Pseudo-3D Residual Net (P3D ResNet), which exploits various bottleneck structures to make the network deeper and have fewer parameters, and the method achieves superior performances over several state-of-the-art techniques. In this paper, we presented a novel two-stream architecture with 3D CNNs, which combines 2D convolution and 3D convolution to learn spatio-temporal features, our proposed network model has less trainable parameters and reduce the computational cost largely.
TDN and beyond
At present, researchers are also actively exploring other types of networks to extract video temporal and spatial information. Liu et al. [28] designed the architecture of TANet to learn the complex temporal dynamics of video data in an adaptive way by introducing a temporal adaptive module with video specific kernels. The kernel parameters are divided into position-sensitive adaptive weights and position-independent adaptive convolution kernel to make the network capture the motion more accurately. Wang et al. [42] proposed the TDN (Temporal Difference Networks) model in which a short-term network and a long-term network are designed to capture local and global dynamic information of actions. This design further improves the accuracy of the model. Islam et al. [19] devised a novel SNSP descriptor to obtain complex spatial information among all skeletal joints by using the prominent joint. Ronghao et al. [32] constructed a spatial module which uses the residual GCN network with the channel attention block to extract the high-level spatial features, and a temporal module which uses the multi-scale neural network to extract the temporal features at different scales. Vishwakarma and Dhiman proposed a new hybrid technique for the description of human action and activity in video sequences, which integrates both global and local details of the action [40]. In the works [6,7], some feature descriptors were presented to detect the abnormal actions. Essentially, action recognition descriptors or spatio-temporal modules are also designed to capture spatiotemporal features, which can be used to human action classification [20,37].
Proposed model
The two-stream architecture is one of the most popular and powerful methods for human action recognition in videos. In this paper, we presented a novel heterogeneous two-stream network for human action recognition in videos, as shown in Fig. 1.

Heterogeneous two-stream architecture for human action recognition in videos.
Our proposed heterogeneous two-stream network consists of two components, i.e., the spatial part and the temporal part. Each video is divided into N segments with equal size, RGB and optical flow images are randomly sampled from those segments for both streams, all inputs follow the weight sharing principle. We apply a global average pooling in each stream to fuse frames sampled from segments to get video-level prediction, the finally prediction is produced by fusing two streams with adaptive weighted average strategy.
Our network architecture is similar to Two-Stream ConvNets [33], RGB frames and TV-L1 optical flow [3] were fed into the spatial stream network and the temporal stream network, respectively. However, we have two completely different components. For the spatial stream, we devise the MCN to learn more spatial-temporal features, which combines 2D CNNs and 3D CNNs. Compared to 2D CNNs, 3D CNNs can capture spatio-temporal information. However, 3D CNNs have more parameters and increase model complexity and training overhead. The MCN obtained better results by the fusion of 2D CNNs and 3D CNNs.
For the temporal stream, the inputs are optical flows which include motion information, we just use the 2D CNNs to capture the motion features to decrease the computational cost. The spatial and temporal streams were trained separately and the prediction results were obtained by taking a weighted average of spatial and temporal streams.
Spatial stream
The MCN was designed for spatial stream to process the RGB frames, which consists of a 2DConvNet and a 3DConvNet as shown in Fig. 2. We choose the BN-Inception [43] architecture as our 2DConvNet which has a good balance between accuracy and efficient. The ResNet introduces shortcut connections which simply perform identity mapping and the outputs of a layer are added to the outputs of stacked layers, the degradation problem is well addressed in this setting [14]. We therefore adopt 3D ResNet architecture as our 3DConvNet. The architectures of the 2DConvNet and 3DConvNet presented in this paper are shown in Fig. 3 and Fig. 4.

The mixed convolution network (MCN). The upper half is a 2DConvNet which consists of two convolutional modules, ten inception modules and one average pool module. The lower half is a 3DConvNet which contains one convolutional module, five 3D-Resnet18 modules and one average pool module. There is a shortcut between the 2DConvNet and the 3DConvNet, i.e., the output of the Inc-3c of the 2DConvNet was fed into the 3DConvNet. Inc-3c is a BN-Inception module which outputs 576 feature maps.


The 3DConvNet architecture.
In our proposed MCN, the RGB frames were fed into the 2D ConvNet and a shortcut was added between the 2D ConvNet and 3D ConvNet. The 2D ConvNet is used to extract image static features and the 3D ConvNet take those static features as inputs to learn spatio-temporal features by the shortcut, the outputs of the 2D ConvNet and 3D ConvNet will are fused after a global average pool layer, the final output is a probability vector for the different class labels.
2DConvNet: The 2DConvNet is used to extract the features of each RGB frame of inputs. The architecture of our proposed 2DConvNet is summarized in Fig. 3. It mainly contains two convolutional layers and ten Inception modules as shown in Fig. 3a. From Fig. 2 we know that the output of Inception-3c layer shall be fed into Inception-4a and our proposed 3DConvNet. For each frame, the output of inception-3c layer has 576 feature maps and their sizes are
3DConvNet: We adopt several modules of 3D-Resnet18 [14], which is an efficient architecture in some video classification works [14,23]. The architecture of the 3DConvNet proposed in the paper is shown in Fig. 4. All the convolution kernels are set to size
The 3DConvNet takes the stacked representations produced by 2DConvNet as input. The number of frames fed into the network is N. After convolution, each frame will obtain C channels, and the size of each channel is
In order to input the output produced by 2DConvNet into the 3DConvNet, we need to expand it along the time dimension. the feature maps obtained by 2DConvNet will first be expanded along the time dimension from

The expansion of the feature map outputted from 2D convolution along the time dimension.
Our temporal stream ConvNet is a 2D convolutional neural network which architecture is same as the 2DConvNet used in the MCN. The temporal stream operates on multiple-frame optical flow, similar to two-stream convolutional networks [33]. The input of the temporal stream is formed by stacking optical flow displacement fields between several consecutive frames. The optical flow and RGB frames are different data. The optical flow data contain motion information (motion trend of pixels) that is not available in RGB frames, which can improve the recognition effect of actions in videos. Figure 6 presents an optical flow displacement between two consecutive frames. The top of Fig. 6 are two consecutive video frames, and the bottom are respectively the horizontal optical flow and the vertical optical flow calculated from the top frames.
Since it will cost long training time when the temporal stream ConvNet uses 3D CNNs. To balance computation overhead and performance, we choose the BN-Inception [43] used in spatial stream to train and test optical flow stacks, the temporal stream ConvNet is shown in Fig. 3.

Optical flow data.
We employ a sparse sampling strategy to sample frames from the entire video. First, we divide the video into N segments and each segment have the same number of frames. Then, we sample one RGB frame and 10 stacked optical flows (horizontal and vertical directions) from each segment randomly. By utilizing this method, we can make sure that the sampled frames are from different parts of the video. Moreover, we can capture the long distance relationship among frames as the location of sampled frames are from different segments and are not consecutive. Singh et al. [34] proposed a sparse coded composite descriptor for human activity recognition, however their method blends the state-of-the-art handcrafted features and the discriminative nature of the sparse representation of visual information.
In our proposed method, a video V will be divided into N segments
The final loss function of category consensus
According to Eq. (3), when we use the mini-batch stochastic gradient descent algorithm to optimize the model parameters W, all segments will participate in the gradient update, therefore, our model parameters are learned from the whole video V, rather than a single segment. Randomly sampling from each segments rather than the whole video, this greatly reduces the computational complexity compared with the method sampling from the whole video [39]. The video-level model parameters can be learned to obtain the long-range temporal information by using our proposed sampling method.
Experiment
In this section, we first briefly introduce the datasets used in our experiments. Then, the data augmentation policy is described in detail. Finally, we elaborate our training details and test details, respectively.
Dataset
To validate our proposed heterogeneous two-stream network, we train our network on the HMDB51 [26] and UCF101 [35] datasets. The HMDB51 contains 6849 video clips and 51 action classes. The UCF101 dataset is a larger human action classification dataset which composed of 13320 video clips and each clip belongs to one of 101 action categories. Compared with HMDB51 dataset, UCF101 dataset has more number of classes and video clips.
Data augmentation
Data augmentation plays an important role in the performance of deep neural networks. Generally speaking, data augmentation can increase training samples and it is very helpful when data is limited. In our experiment, we adopt random cropping, horizontal flipping, corner cropping and scale jittering technique introduced in [16]. First, the RGB and optical flow frames are resized to
Train detail
All experiments perform on the split 1 of the two datasets, we use SGD algorithm with momentum to train our network, the momentum is set to be 0.9, and weight decay is
Test detail
For spatial stream, we do not use any data augmentation policies, just put the 16 RGB frames into the network and it will produce a score file that contains the video labels and network output. As for temporal stream, we first crop 4 corners and 1 center and then horizontal flipping from the sampled frames, this operation will produce tenfold data, it will also produce a score file like spatial stream. For the fusion of the spatial stream and temporal stream, we take a weighted average strategy like TSN. In TSN, predictions from spatial stream and temporal stream are fused to produce the final prediction with average strategy. However, the weights are not fixed in our method, we adopt an adaptive way to select the best weights automatically. After training the spatial stream and temporal stream, we fix their parameters and then train the fusion weights to obtain the best weights.
Result
In this section, we first discuss the performance of MCN with different number of RGB frames and evaluate our proposed heterogeneous two-stream network on the UCF101 dataset and HMDB51 dataset. Then, the effect of 3DConvNet introduced in Section 3.2 is verified through experimental study. Finally, We conduct experiments with comparison to the state-of-the-art to validate the proposed method.
Number of sampled frames
Table 1 shows the classification accuracy and runtime of MCN when training and testing with different number of sampled frames on the split 1 of ucf101 dataset. As expected, the classification accuracy improves when increasing the sampled frames, this may be because important parts of actions can be captured with more sampled data.
We also measure the training time of the network when different number of frames are inputed. Taking into account the accuracy and efficiency, we choose 16 RGB frames as the final input to the spatial stream network MCN.
The Top-1 classification accuracy on split 1 of UCF101 dataset with different number of sampled RGB frames
The Top-1 classification accuracy on split 1 of UCF101 dataset with different number of sampled RGB frames
Top-1 classification accuracy on the UCF101 dataset and HMDB51 dataset
The performance of our proposed heterogeneous two-stream network is summarized in Table 2. The results in Table 2 follow the evaluation protocol provided by the HMDB51 dataset and UCF101 dataset, and use standard training/testing splits. the spatial stream is based on MCN pretrained on Kinetics and the temporal stream use BN-inception network pretrained on ImageNet. There are several noteworthy observations in Table 2.
First, we can see that the performance of spatial stream and temporal stream on HMDB51 dataset is very close, which implies that the spatial stream model also can get admirable result via carefully designed network or learning strategy.
Second, spatial stream achieves better performance than temporal stream on UCF101 dataset. This indicates that larger datasets are more conducive to extracting appearance information.
Third, we can observe that the fusion of spatial stream and temporal stream can achieve 95.27% and 73.04% accuracy on the UCF101 dataset and HMDB51 dataset on average respectively, which show that the two-stream method is a powerful approach and the accuracy of human action recognition can be greatly improved by fusing spatial stream information and temporal stream information.

The ROC curves of HTSN on dataset UCF101.
Finally, the performances of all splits on HMDB51 dataset are far lower than those obtained on UCF101 dataset, which indicates the difficulty level of the two datasets is different, or the other reason may be the lack of training data for HMDB51 dataset.
The receiver output characteristic (ROC) curve plotted between true positive rate (TPR) and false positive rate (FPR), in Fig. 7 and Fig. 8, also supports the fact that our proposed heterogeneous two-stream architecture turned out better action representation for datasets UCF101 and HMDB51. We also have computed area under the curve (AUC) of ROC. As shown in Fig. 7 and Fig. 8, the obtained AUC values are 0.99, 0.99, 0.97 and 0.96 for micro-average ROC curve and macro-average ROC curve in datasets UCF101 and HMDB51. From where, it can be easily visualized that the ROC curves of our proposed HTSN are showing superior performance, which support the fact that the fusion of the scores of heterogeneous two streams has resulted in increase in correct selection of true samples i.e. improved true positive rate (TPR).

The ROC curves of HTSN on dataset HMDB51.

Visualization of the fist convolution kernel of our proposed two stream model.
To further illustrate the learned network, we visualized the first convolution kernels of two streams as shown in Fig. 9. In Fig. 9, 64 filters are from the first convolution layer of our two stream network, each filter size is
An ablation study was provided to observe the effect of our proposed MCN in this Section. The MCN is a hybrid method of 2D and 3D convolutions. To demonstrate the effectiveness of the 3DConvNet, we compare the following two settings: MCN with 3DConvNet and MCN without 3DConvNet. The result is illustrated in Table 3. From the Table 3, we can see that the performance achieved by MCN with 3DConvNet is better than that obtained by MCN without 3DConvNet. Since 3DConvNet can learn relationships between the sampled frames, the MCN with 3DConvNet leads to a performance boost against the MCN without 3DConvNet by 1.8% to 5.3% in terms of top-1 video-level accuracy. This result supports the effectiveness of our proposed MCN with the 3DConvNet and it validates that the combination of learned spatio-temporal features from 3DConvNet and appearance information learned from 2DConvNet is more advantageous to recognize video actions.
Moreover, we found that the training time of the model without 3DConvNet was about 10 hours, and the training time of the model with 3DConvNet was about 5 hours. The MCN with 3DConvNet is trained for about half as long as the MCN without 3DConvNet, the main reason is that the model with 3D converges faster and requires only a small number of epochs.
An ablation study to observe the effect of 3DConvNet in designing the proposed MCN on the HMDB51 and UCF101 dataset
An ablation study to observe the effect of 3DConvNet in designing the proposed MCN on the HMDB51 and UCF101 dataset
We now compare the performance of our proposed heterogeneous two-stream network with state-of-the-art approaches in Table 4 for UCF101 and HMDB51 datasets. The experiments are performed on three splits in the two datasets and the results in Table 4 are their average values. Our proposed method achieves 95.27% and 73.04% accuracy on UCF101 and HMDB51 datasets, respectively. From Table 4, we can see that our network is superior to all existing methods except I3D [2], which uses Inflated Inception-V1 pertained on ImageNet and Kinetics datasets. For HMDB51 dataset, compared with the classic two-stream network [33] and TSN [16], our heterogeneous two-stream network has improved by 22% and 5.2% in terms of Top-1 accuracy, respectively. Our method also achieves a 5.8% improvement over the TS-LSTM [29] which operates LSTM over high level representations generated from spatial and temporal streams.
Moreover, our result simply using the spatial flow model obtains a 0.1% increase of the performance over C3D [38], which just uses 3D convolution to learn spatio-temporal features. This demonstrates the value of the mixed convolution and the effectiveness of our MCN once again.
Experiment results in Table 4 show that our proposed heterogeneous twos stream network achieves good performance. More importantly, our proposed spatial stream model MCN which uses 2D convolution plus 3D convolution to learn features of video actions exhibits better performance over those that only use 3D convolution networks. Our proposed spatial stream architecture of mixing 2D and 3D convolutions has advantages in processing the video data. In addition, the fusion method of spatial and temporal streams has consistently achieved good performance, to some extent, the fusion of multiple data can significantly improve the classification results.
Comparison with state-of-the-art on the UCF101 and HMDB51 datasets
Comparison with state-of-the-art on the UCF101 and HMDB51 datasets
Human action recognition is currently a research hotspot of image analysis and computer vision. In the present study, we designed a mixed convolution network to construct a heterogeneous two-stream network for human action recognition in videos. our approach uses mixed convolution network to train RGB frames for learning spatial information and BN-Inception network to train optical flow frames for capturing the changes in motions over time. As demonstrated on two challenging datasets UCF101 and HMDB51, our work achieves good performance and outperforms some of the state-of-the-art methods.
Human action recognition approach based on two-stream network has been greatly developed as one of mainstream methods. The two-stream network architecture is usually composed of a spatial stream network and a temporal stream network, which extract spatial and temporal features respectively. The fused spatio-temporal features are then used for human action classification. However, our proposed approach has two key differences from conventional approaches. Our method does not focus on 3D convolutions, but instead builds in a mixed convolution. There is a shortcut between the 2D and 3D convolutions. Moreover, sparse sampling is used to capture the long distance relationship among frames as the location of sampled frames are from different segments and are not consecutive, and to decrease the calculation cost. The experiments validate the effectiveness of our method and provide a new insight into the spatio-temporal feature extraction.
In our spatial stream model MCN, the mixed convolution and sparse sampling strategy are very important to learn the spatial-temporal features between frames that are far apart. Experiment results show that the performance of spatial stream model is better than that of temporal stream model, which once again verified the validity of our proposed spatial-stream model MCN. The optical flow data not oly contain motion information of human actions but also contain the feature difference between two frames, which are valid and comprehensive in representing motion information for human action recognition [36]. Therefore, we select optical flow data as the input of our temporal-stream model to extract temporal features of video action recognition. Experiments have shown that the performance of 2D convolutional networks is better compared with 3D convolutional networks when the input is optical flow data [37]. Therefore, our temporal stream model uses the 2D convolutional network.
There are some approaches which use other data modalities such as skeleton data and event stream for human action recognition [18,44]. Experiment results show that the performance of our temporal stream model is poorer than that of our spatial stream model, there may be opportunities for future research on designing powerful architectures to fully utilize the hidden information in optical flow frames or using other data modalities for human action recognition.
Conclusions
In this paper, we improved on the traditional two-stream and TSN approach and developed a heterogeneous two-stream network, which uses mixed convolution network to train RGB frames for learning spatial information and BN-Inception network to train optical flow frames for capturing the changes in motions over time. As demonstrated on two challenging datasets UCF101 and HMDB51, our work achieves good performance and outperforms some of the state-of-the-art methods. In our spatial stream model MCN, the mixed convolution and sparse sampling strategy are very important to learn the spatial-temporal features between frames that are far apart. Experiments have validated the feasibility and effectiveness of our proposed method. Experiment results show that the performance of spatial stream model is better than that of temporal stream model, there may be opportunities for future research on designing powerful architectures to fully utilize the hidden information in optical flow frames.
Footnotes
Acknowledgements
This work was support by National Key R&D Program of China titled with the Large-Scale Longitudinal and Cross-Sectional Study of Student Development (grant number: 2021YFC3340800); and National Natural Science Foundation of China (grant number: 62077023 and 61937001).
