Abstract
Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification. For this to be successful, the network must be capable of suppressing irrelevant scene background and extract the representation from the most discriminative part of the video. Our contribution builds on the observation that spatio-temporal patterns characterizing actions in videos are highly correlated with objects and their location in the video. We propose Top-down Attention Recurrent VLAD Encoder (TA-VLAD), a deep recurrent neural architecture with built-in spatial attention that performs temporally aggregated VLAD encoding for action recognition from videos. We adopt a top-down approach of attention, by using class specific activation maps obtained from a deep Convolutional Neural Network pre-trained for generic image recognition, to weight appearance features before encoding them into a fixed-length video descriptor with a Gated Recurrent Unit. Our method achieves state-of-the-art recognition accuracy on HMDB51 and UCF101 benchmarks.
Introduction
Despite the recent advancements in deep learning which resulted in huge performance improvements in computer vision tasks such as image recognition [13], object detection [23], semantic segmentation [22], etc., video action recognition still remains a challenging task. This can be attributed mainly to two major reasons, one being the lack of large scale video datasets to enable deep networks with millions of parameters to be tuned effectively to the given task, which can be partly solved by making use of large image datasets such as ImageNet [24] for pre-training the network. The second and the most important challenge is present in the nature of the data itself, i.e., the varying duration of action instances and the huge variability of action-specific spatio-temporal patterns in videos. The former can be addressed by sampling operations such as max pooling or average pooling while the latter requires a careful design of network structure capable of encoding the spatial information present in each frame in relation to how it evolves in subsequent frames with each action instance.
In this paper, we propose to use spatial attention during the feature encoding process to address this problem. Spatial attention has been proved to be useful in several applications such as image captioning [38], object localization [33], saliency prediction [37], action recognition [25]. The majority of these works use bottom-up attention whereas we propose to use top-down attention. Both these attention mechanisms are used by the human brain for processing visual information [18]. Bottom-up attention is based on the salient features of regions in the scene such as how one region differs from another, while top-down attention uses internally guided information based on prior knowledge such as the presence of objects and how they are spatially arranged. Since the majority of actions are correlated with the objects being handled, we propose to make use of the prior information embedded in a Convolutional Neural Network (CNN) trained for image classification task for identifying the location of the objects present in the scene. The location information is decoded from the raw video frames in the form of attention maps for weighting the spatial regions that provide discriminant information in differentiating one class from another. We build on class activation mapping [41] which was originally proposed for fine-grained image recognition for generating the attention map. Once the pertinent objects in the frames are identified and located, we aggregate the features from the regions into a fixed-length descriptor of the video. This is illustrated in Fig. 1 in which the object that is representative of the action class, the bow, is changing its position in subsequent frames. We use recurrent memory cells with parameters to learn such spatio-temporal aggregation for video classification.

The figure shows the video frames taken from the action class ‘shoot bow’. The object that is representative of the action class, the bow, is changing its position in subsequent frames and determines the frame regions for feature extraction. In addition, the temporal order with which the image objects evolve into the action is relevant for aggregating frame features into a video descriptor.
This paper encloses the following contributions. We present Top-down Attention Recurrent VLAD Encoder (TA-VLAD), an end-to-end trainable deep architecture that integrates top-down spatial attention with temporally aggregated VLAD encoding for action recognition in videos. TA-VLAD uses (i) class specific activation maps obtained from a deep CNN pre-trained for image recognition as the spatial attention mechanism, a (ii) latent cluster representation of the feature space, and a (iii) Gated Recurrent Unit (GRU) for temporal encoding in the cluster space. TA-VLAD can be trained end-to-end using video-level annotations, that is, the parameters of (i) and (iii) together with the compact representation of feature space (ii) are learned from videos paired with action class labels. We perform an experimental validation of the proposed method on two most popular action recognition datasets and compare the results with the state-of-the-art. We release an implementation in PyTorch at https://github.com/swathikirans/ta-vlad
The paper is organized as follows. We discuss related works in Section 2. TA-VLAD is presented in Section 3, followed by our analysis of experimental results in Section 4. In Section 5 we present our conclusions.
Action recognition
Compared to image recognition problem which requires encoding of spatial information alone, action recognition problem demands the encoding of spatio-temporal patterns present in multiple frames of a video. Several techniques have been proposed for extracting spatio-temporal information present in videos. Simonyan and Zisserman [26] propose to use stacked optical flow along with video frames for encoding both appearance and motion information present in the video. A huge performance improvement in terms of recognition accuracy was observed after incorporating the optical flow stream to the image based CNN, which confirms the fact that a simple CNN has limited capability in capturing spatio-temporal information. Wang et al. [35] propose to combine improved dense trajectory method with the two stream network of [26] to perform an effective pooling of the convolutional feature maps. The original two stream network [26] is further improved by adding residual connections from the motion stream to the appearance stream in [9]. Wang et al. [36] propose to use several segments of the video on a two stream network and predict the action class of each segment followed by a segment consensus function for predicting the action class of the video. This enables the network to model long term temporal changes. A 3D convolutional network is proposed by Tran et al. [34] to encode spatio-temporal information from RGB images. Carreira and Zisserman [3] propose to combine this model with the two stream model to further improve the spatio-temporal information captured by the network. Several techniques that use Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) [7, 25] and Convolutional Long Short- Term Memory (ConvLSTM) [29, 30], have also been proposed for encoding long term temporal changes. Girdhar et al. [11] propose an end-to-end trainable CNN with a learnable spatio-temporal aggregation technique.
Attention for action recognition
A number of works have been recently proposed by researchers for video based action recognition incorporating attention mechanism [8, 40]. Girdhar et al. [10] proposes an attentional pooling layer by extending the average pooling operation to weighted average pooling. Sudhakaran and Lanz [31] propose an object-centric attention model for egocentric activity recognition utilizing the class-specific activations from a CNN pre-trained for generic image recognition. The aforementioned approaches generate attention maps independently for each frame, hereby overlooking the temporal correlation between the frames of a video. This limitation is addressed in [8, 40]. Sharma et al. [25] and Zhang et al. [40] propose to generate the attention weights from the hidden state of an RNN such as LSTM. Since the hidden state of the RNN encodes information about the previous frames, the attention maps are generated taking into consideration the information present in the previous frames of the video. Li et al. [21] propose to use a ConvLSTM to encode the spatio-temporal information and the attention map is generated from a single RGB frame and its corresponding optical flow. Sudhakaran et al. [28] extend the ConvLSTM module with in-built attention and coupled output gating to localize and track discriminative features across a sequence, obtaining state-of-the-art performance in egocentric activity recognition. Du et al. [8] propose to use spatio-temporal attention in order to weight the relevant spatial regions in relevant frames. This is based on the idea that not all frames are equally important in identifying the action taking place in the video.
Discussion
Sequential encoding of convolutional features using RNNs have been found to be successful in action recognition problem, the most notable work being LRCN [7]. In this work, the convolutional features (feature tensor) from a CNN are flattened and applied to an LSTM module for temporal encoding. This approach disregards two important aspects of spatial features, the flattening operation which results in the loss of spatial structure of the scene and secondly, it considers all spatial regions to be equally important which is not the case. Sudhakaran and Lanz [31] address these two issues by replacing the LSTM with ConvLSTM so that the feature tensor can be applied directly for temporal encoding without spatial flattening and by applying spatial attention on the feature tensor to weight relevant feature regions. Their approach has resulted in improved recognition performance in egocentric activity recognition. Even though attention is used to weight important regions, the ConvLSTM module considers the feature tensor as a whole and as a result, the network encodes the motion changes in a global level. Vector of Locally Aggregated Descriptors (VLAD) encoding transforms the feature tensor to a latent cluster representation by aggregating similar local features. Tracking the transition of these clusters enables the encoding of local motion information across the frames. In ActionVLAD [11] the VLAD feature descriptors from each frame of a video are averaged together to obtain a video level descriptor for action recognition. This averaging operation fails to encode the sequential information present in the frame level descriptors. Based on these considerations, in this work, we address the above mentioned limitations of existing feature encoding techniques for action recognition. We achieve this by combining the advantage of spatially selective encoding proposed in [31] with the cluster based local aggregation technique of [11].
Top-down attention recurrent VLAD encoding
This section details our proposed method named Top-down Attention Recurrent VLAD Encoder (TA-VLAD). We build our method on the recently proposed Action VLAD [11] for action recognition. In this, the authors develop an end-to-end trainable architecture that can perform VLAD aggregation of convolutional features extracted from a sequence of frames using CNN. In order for the paper to be self-contained, we briefly explain VLAD encoding for action recognition and then present the details of our model.
VLAD encoding
VLAD has been originally proposed for image retrieval application [17]. The method first generates a codebook of visual words c
k
from local feature vectors extracted from a set of training images using k-means clustering. Given a new image with local feature vectors f
i
(i indexes location), the residual vectors (f
i
- c
k
) are aggregated to form an image level descriptor V using membership values a
k
(f
i
)
In the original paper [17] the feature vectors were hand-crafted and aggregated with hard assignment, that is, a k (f i ) will be 1 if f i is closest to c k and 0 otherwise.
The hard assignment was later replaced with a soft-assignment in [1] to form an end-to-end trainable CNN-VLAD network for place recognition. Their soft-assignment membership can be conveniently described by
Later Girdhar et al. [11] extended this method to action recognition by performing summation of the frame level descriptors, that is
Kim and Kim [19] propose an extension to the original VLAD approach by weighting the descriptors depending on their importance. For determining the spatial image locations which contain discriminant information, a saliency map of the image is used. Their method resulted in performance improvements in image retrieval application. We build upon this idea of weighted aggregation of VLAD descriptors for encoding visual features for action recognition. Instead of using a saliency detection algorithm, we develop a deep neural network with built-in weighting mechanism whose parameters are trained and which adds only minimal computation overhead to training and inference stages of the whole pipeline. Our image descriptor is computed using equation (1) with membership values
Saliency detection involves identifying the regions of interest present in an image. In action recognition, this constitutes the image regions where humans and the objects that they are interacting with are present. We propose to utilize the class activation map (CAM) generation technique proposed in [41] for generating class specific saliency map of the video frames under consideration. The idea of CAMs is to project back the weights of the output layer to the convolutional feature map in the preceding layer. Let f
li
be the activation of a unit l in the final convolutional layer at spatial location i and
Fig. 5 shows the CAM obtained for some of the frames from the videos of HMDB51 [20] dataset (second row). We used ResNet-34 pre-trained on ImageNet [24] and the class category with the highest probability is selected for the CAM computation. In the figure, the regions in the image where the action under consideration is taking place are getting activated such as the hands in Figs. 5f and 5h. Thus, it can be seen that irrespective of the fact that the CNN was pre-trained on a different dataset for a different application, the CAM generated is able to identify the salient regions present in the frames, i.e., regions that provide discriminative information in identifying the action taking place in the frame. By jointly training the CNN layers and the weights (W) associated with each of the classes using video level supervision, the network is made to develop its own representation classes that facilitate in recognizing each action category.
In this section, we will describe in detail the proposed action recognition technique. A block diagram of the proposed approach is shown in Fig. 2. In the proposed method, each frame from the input video, uniformly selected across time, are sequentially applied to the network. A deep CNN, ResNet-34 pre-trained on ImageNet dataset, extracts the features from each frame. The extracted feature tensor is then applied to the spatial attention map generation and VLAD encoding modules. The VLAD encoded feature vectors are then applied to a series of Gated Recurrent Units (GRUs) for temporal encoding. The memory of the GRU network obtained after the encoding of all the frames from the video is applied to a fully-connected layer for action classification. In the network, feature extraction from frames, top-down attention computation and VLAD encoding is carried out in a single forward pass.

Block diagram of TA-VLAD architecture is given in Fig. (a). We use ResNet-34 for frame-based feature extraction and attention map generation, and temporally aggregated residual encoding to predict the action class from a frame sequence. The proposed VLAD encoding and spatial attention map generation methods are shown in Figs. (b) and (c), respectively. All the layers of TA-VLAD are differentiable, enabling end-to-end training with video level supervision.
Spatial attention. The purpose of top-down attention is to assign a higher weight to the regions present in the image that are useful for discriminating one action class from another while assigning a lower weight value to those regions that possess less discriminant information. Fig. 2b illustrates the attention map generation technique. The frame level feature tensor f is first used to obtain object-class scores using spatial average pooling and a fully-connected linear layer. The winning class p (with highest class score) is identified and the corresponding weights W
p
of the fully-connected linear layer are retained. The feature activation planes of frame-level tensor f are then weighted using W
p
and depth-wise average pooled to obtain a spatial map M
p
. Top-down attention map is thus obtained using the following equations
VLAD encoding. The proposed VLAD encoding method is illustrated in Fig. 2c. The frame level feature tensor f obtained from the CNN is applied to a convolutional layer followed by softmax operation to generate the soft-assignment weights (a k ) explained in Section 3.1. The weights and bias of convolutional layer is initialized with W k and B k , respectively, as given in Section 3.1. The soft-assignment weights are then multiplied with the attention map and the difference of the feature tensor f and the cluster centers c k to obtain K tensors. We finally apply spatial sum pooling to obtain a frame descriptor composed of K vectors, each associated to one of the cluster centers.
Temporal aggregation. Girdhar et al. [11] performed summation across the temporal dimension (summation of image descriptors obtained from all frames) for generating the video descriptor. This simple summation operation is not capable of encoding the temporal evolution of the frame level features as it does not take into account the temporal ordering of frames. Following previous works [7, 29], we choose to apply an RNN for this purpose. Chung et al. [6] have found that both LSTM [14] and GRU [5] perform comparably when evaluated on the tasks of polyphonic music modeling and speech signal modeling. With GRU having the added advantage of less parameters, we decided to use GRU in order to effectively encode how the frame level features evolve as an action progresses. The commonly followed approach is to concatenate the K vectors composing the VLAD descriptor and apply it to a GRU layer. Following this approach will result in losing the structure of the local feature patterns encoded within the clusters since the gates in GRU layers combines information from multiple clusters during the encoding stage. The alternative is to use K GRU layers, each encoding the residual information present in each of the K clusters. The disadvantage with this approach is the huge increase in the number of parameters of the resulting model, leading to increased training complexity and propensity to overfitting. Taking into consideration the above mentioned problems, we choose to use K GRU modules with shared parameters, thereby having a separate memory state for each cluster, with less increase in the number of effective parameters. This enables the network to encode how the features associated with each of the clusters change with time. Since the clusters contain information about local features, the network is thus capable of tracking the local feature transformations occurring across the video frames. Once all the video frames are processed, we concatenate the final memory state of all the GRU modules to generate the temporally encoded descriptor. Then intra-normalization is applied as proposed in [2] followed by reshaping operation to obtain the vector and L2-normalization to get the final feature descriptor of the input video. This is followed by a single fully-connected layer for classification.
The proposed method is tested on two popular action recognition datasets, namely, HMDB51 [20] and UCF101 [27] and compared against state-of-the-art deep learning techniques. HMDB51 dataset consists of videos divided into 51 action classes collected from movies and YouTube. The dataset is composed of 6849 video clips. UCF101 dataset consists of 13320 videos collected from YouTube and has 101 action categories. For both the datasets, three train/test splits are provided by the dataset developers. Following standard practice, the recognition performance on the datasets is reported as the average of the recognition accuracy obtained on the three splits.
Implementation details
As mentioned in Section 3, we use ResNet-34 for feature extraction and attention map generation. We use the feature tensor obtained from the final convolutional layer of ResNet-34 as the input features. The proposed method is implemented using PyTorch framework. The method consists of the following steps:
Pre-processing, in which we extract convolutional features from random frames from the videos present in the training set, and perform k-means clustering on the extracted features to obtain the cluster centers;
Stage 1 training, in which only the fully connected classifier layer and GRU network are trained while all the other parameters remain fixed;
Stage 2 training, in which all the convolutional layers in the final layer of ResNet-34 (layer 4) and fully-connected layer of ResNet-34, VLAD layers, GRU network and the classifier are trained.
Stage 1 training acts as an initialization of the classifier and the GRU network while stage 2 training optimizes the features extracted from the frames, the clusters, and the attention to specialize to the given action recognition task.
The number of clusters and the memory size of the GRU network are chosen as 64 and 256 respectively, following a detailed hyper-parameter search study which is explained in Section 4.2. The network is trained for 30 epochs with a learning rate of 10-2 and 10-4 for stages 1 and 2, respectively. In stage 1, the learning rate is decayed by a factor of 0.5 after every 5 epochs while for stage 2 the learning rate is halved every 8 epochs. The network is trained using ADAM optimization algorithm with a batch size of 16. We also apply dropout at the final fully-connected layer at a rate of 0.5. The value of α, explained in Section 3.1, is chosen as 1000 following the proposal by Girdhar et al. [11].
We used 25 frames from each video clip that are uniformly sampled across time during training and evaluation. The corner cropping and scale jittering techniques proposed in [36] is used as data augmentation during training. The center crop of frames are used for determining the action class during evaluation.
Results and discussion
We first performed experiments on the first split of HMDB51 dataset to determine suitable values for the number of clusters and the memory size of the GRU network. The results of the analysis is shown in Fig. 3. From the graph, it can be seen that the best performance is obtained with a configuration having 64 clusters and 256 hidden units in the GRU. We, thus, fixed the number of clusters as 64 and the memory size of the GRU network as 256.

Results of the experiments carried out on the first split of HMDB51 dataset to determine suitable values for the number of clusters and memory size of GRU.
Bidirectional RNNs have been proposed for applications involving sequences such as gesture recognition [39], speech recognition [12], named entity recognition [4] and machine translation [32]. This involves adding a second RNN layer where the input sequence is applied in the reverse direction. Inspired by the performance improvement obtained in the above mentioned works, we also evaluated our model by adding a second GRU layer where the input is applied in the reverse direction. In our method, we concatenated the memory state of the two GRUs and are then applied to the classification layer. We observed an improvement of +1.2% over the model with a single GRU encoding the sequence in the forward direction. However, the improvement observed was not that significant on the other splits of HMDB51 and on UCF101 dataset (Table 1). This can be explained by the nature of the data. In the case of applications involving language such as machine translation or speech recognition, the input at a particular time step may or may not depend on the input at a later time step. By encoding the sequence also in the reverse direction the network is able to look into the future and hence improve its performance. Also, in the proposed method, the output class is predicted from the final memory state of the GRU after encoding all the input frames as opposed to the above mentioned works, where an output is predicted at each time step. By using the final memory state, the network is able to encode all the information from the video sequence before making the prediction. On top of this, the additional GRU layer will increase the number of parameters of the network thereby increasing memory footprint and the propensity to overfitting. We also evaluated the proposed model with bidirectional GRU having a memory size of 128, in order to have the same number of parameters as TA-VLAD-GRU and obtained a recognition accuracy of 56%.
Comparison of the proposed approach against state-of-the-art methods on HMDB51 and UCF101 datasets (recognition accuracy in %)
Comparison of the proposed approach against state-of-the-art methods on HMDB51 and UCF101 datasets (recognition accuracy in %)
For validating that training the layers of the backbone network used for CAM computation is useful, we decoupled the ResNet-34 used for generating the attention map from the rest of the trainable layers. With the attention generation branch not fine-tuned, a recognition accuracy of 55.3% is obtained on split 1 of HMDB51 dataset, as opposed to 56.1% with the fully trained TA-VLAD. This shows that joint training, as explained in Section 4.1, enables the network to improve the prior knowledge encoded within it, i.e., the relevance of objects and their locations in relation to the action performed. This can be interpreted from the example images shown in Fig. 5. From the figure, we can see that the network learns to attend to regions that contain discernible information about the action class. In addition, the network has adapted its attention for both actions containing objects (5b, 5c and 5d) as well as actions that are performed in the absence of objects (5a, 5e and 5g).
We also evaluated the performance of the network with a single GRU taking in the flattened VLAD vector of dimension 1 × K * 512 as input and obtained a recognition accuracy of 47.6%. With this configuration the locally aggregated descriptors provided by VLAD encoding interact in the recurrent updating of a joint GRU memory state, hereby bypassing the learned latent structure (clustering) of the feature space. Indeed, a reduction of –8.5 percentage points in recognition accuracy is observed, validating our hypothesis that the flattening operation prevents the network from encoding the local spatial transformations effectively, as explained in Section 3.3. We also analyze the opposite model configuration, i.e. when learning to track a memory state for each feature space cluster independently. When the shared GRU in the proposed technique is replaced with K separate GRUs, the recognition accuracy was found to be 47.8%, that is a reduction of –8.3 percentage points. It is worth to note that the RNN layer parameter count is increased by a factor of K (K = 64 in our analysis), that may increase the network’s propensity to overfitting. In TA-VLAD instead, all K aggregated descriptors contribute to the optimization of the single set of GRU parameters, and through parameter sharing the memory tracking observed during training for one cluster can be leveraged for memory tracking at any of the other clusters. This analysis shows that the proposed approach of using K GRU layers with shared parameters result in a most effective temporal aggregation of the frame level descriptors.
Next we analyzed the improvement obtained by the proposed method over the baseline, Action VLAD [11], whose results are shown in Fig. 4. We adapted the Action VLAD implementation using Resnet-34 for fair comparison since we use a more powerful and deep CNN, Resnet-34, for frame level feature extraction compared to the VGG-16 used in [11]. We then evaluated the performance of the network after replacing temporal summation with the proposed temporal aggregation method consisting of GRU network (VLAD-GRU). This resulted in an improvement of 1.6% (53.1 vs 54.7), which validates our hypothesis that preserving the temporal ordering of frame level feature descriptors is important for action recognition. In order to verify that spatially selective encoding of features using spatial attention is effective, we applied the proposed attention mechanism on ActionVLAD (TA-ActionVLAD) and evaluated the performance. We achieved an improvement of 1.7% (54.8%) over ActionVLAD baseline, thereby proving that not all spatial regions possess discriminant information for action recognition task. In the next step, we evaluated the performance of the proposed approach, TA-VLAD that combines VLAD encoding with the proposed top-down attention mechanism and GRU sequence learning, and an improvement of 3% (56.1%) is obtained over ActionVLAD.

Ablation analysis carried out on the first split of HMDB51.

Class activation maps (CAM) for some frames in HMDB51 dataset with action classes containing objects (top) and without objects (bottom). Top row: original frames, second row: CAM obtained using ResNet-34 trained on ImageNet, bottom row: CAM obtained using the network trained for action recognition (after stage 2 training).
Table 1 compares the proposed approach, TA-VLAD, with state-of-the-art techniques. The results are the average of the recognition accuracy in % obtained over all three splits. For fair comparison, we report the performance of the compared methods using RGB frames only and consider those methods that are based on deep learning and use ImageNet for pre-training. For each of the three splits in HMDB51, we obtained recognition accuracies of 57.3%, 53.1% and 54.4% and on UCF101 dataset, 85.7%, 85.5% and 85.2%. The first block on the table shows methods which do not use attention and the second block shows those that use attention for selecting the relevant regions present in the input frames. From the table, it can be seen that the proposed approach performs better than the state-of-the-art deep learning methods for action recognition on RGB image sequences.
Conclusions
We presented a novel end-to-end trainable deep neural network architecture for video action recognition which makes use of top-down attention mechanism for weighting spatial regions that possess discriminant information regarding the action class. For this, we used class activation maps generated from a network pre-trained on ImageNet. Experiments show that prior information about the objects present in the scene, which is applied as top-down attention, improves recognition performance of the network. We also developed a temporal aggregation scheme that encodes frame level features into a fixed length video descriptor using a GRU network that inherits the cluster structure of the feature space. The boost in performance obtained shows the importance of considering the temporal ordering of video frames during the feature encoding process. The proposed method is tested on two most popular action recognition datasets and achieves state-of-the-art performance in terms of recognition accuracy. In addition, it was also found that the network is able to improve the prior knowledge about the scene when the top-down attention generation network is trained jointly with the video descriptor generation network. In order to improve the performance, existing approaches use optical flow as a second modality to explicitly encode motion changes and as a future work, we will explore the possibility of adding attention mechanism to the flow modality to further improve the performance of our method.
