Abstract
Many micro-video related applications, such as personalized location recommendation and micro-video verification, can be benefited greatly from the venue information. Most existing works focus on integrating the information from multi-modal for exact venue category recognition. It is important to make full use of the information from different modalities. However, the performance may be limited by the lacked acoustic modality or textual descriptions in uploaded micro-videos. Therefore, in this paper visual modality is explored as the only modality according to its rich and indispensable semantic information. To this end, a hybrid-attention and frame difference enhanced network (HAFDN) is proposed to generate the comprehensive venue representation. Such network mainly contains two parallel branches: content and motion branches. Specifically, in the content branch, a domain-adaptive CNN model combined with temporal shift module (TSM) is employed to extract discriminative visual features. Then, a novel hybrid attention module (HAM) is introduced to enhance extracted features via three attention mechanisms. In HAM, channel attention, local and global spatial attention mechanisms are used to capture salient visual information from different views. In addition, convolutional Long Short-Term Memory (convLSTM) is enforced after HAM to better encode the long spatial-temporal dependency. A difference-enhanced module parallel with HAM is devised to learn the content variations among adjacent frames, which is usually ignored in prior works. Moreover, in the motion branch, 3D-CNNs and LSTM are used to capture movement variation as a supplement of content branch in a different form. Finally, the features from two branches are fused to generate robust video-level representations for predicting venue categories. Extensive experimental results on public datasets verify the effectiveness of the proposed micro-video venue recognition scheme. The source code is available at https://github.com/hs8945/HAFDN.
Keywords
Introduction
Recently, the rapid development of multimedia platforms like Instagram, Twitter, Tiktok, Vine, and Wechat generates a large amount of micro-videos. Such popular platforms enable users to conveniently share their stories on the Internet by personal phones. Meanwhile, various related applications including information retrieval [1] and micro-video recommendation [2] are emerged. Micro-video venue recognition [3–5] is an important branch of micro-video understanding. Such task can be concluded as a specific venue category (e.g., Garden and Lake) prediction based on a given micro-video. Related information with venue categories can be further mined with great potential to infer the personal habit or preference of the user, such as the flower preference or food taste.
Although billions of micro-videos are generated on various social platforms every day, few users label venue information on their uploaded videos. According to the statistics in [6], only 1.22% of two million micro-videos were labeled with venue information. In order to compensate for such shortage, Zhang et al. [6] first built a large-scale micro-video dataset based on Vine platform with corresponding venue information. In such work, a tree-guide multi-task multi-modal method is proposed to predict venue categories. Then, Nie et al. [7] proposed a deep transfer model by enhancing the acoustic modality with external sound knowledge to improve the prediction performance. Wei et al. [3] explored the consistency and complementarity among different modality information to improve the expressiveness of each modality. However, most micro-videos from social platforms may lack textual descriptions or have noisy acoustic modality information. The performance and university of the related methods [3, 7] are limited in such condition.
To tackle above issue, the importance of visual information from micro-videos is revalued and focused. Considering that visual modality is indispensable in a micro-video, the corresponding information is vital for inferring venue categories. So the quality of visual features extracted from backbone network has a huge impact on the final performance. A lot of works have verified the excellent performance of the CNNs [8–15] which provide convenience for our task. Meanwhile, the temporal information of micro-video is crucial in micro-video venue recognition [4, 17]. The temporal information is captured in a special stage by sequence model rather than feature extraction stage. Such special stage increases the complexity of the framework. Inspired by temporal shift module (TSM) [18] and considering the universality and scale of model, a domain-adaptive network named ResNet-DT is proposed based on ResNet and TSM module. Such model can extract the temporal information of micro-videos in the feature extraction stage, and improve the quality of features in an effective way at the same time. In particular, ResNet50-DT can achieve approximately performance to ResNet152 with fewer than half of the parameters.
To better utilize visual information in micro-video, some works [2, 19] have learned visual features from global and local perspectives like scene-related or object-related information to enrich the expressiveness of visual modality. But they simply utilize one perspective to obtain related features via different pretrained networks without any enhancement. Venue and object related features are simultaneously stressed by introducing different attention mechanisms. Hence, a novel hybrid-attention module (HAM) is designed to enhance robust visual semantic information. Three different attention mechanisms [20, 21] are used to highlight valuable parts of visual features. Specifically, the venue-related information is enhanced by channel attention (CA) mechanism, and the object or person which is closely related to the venue is emphasized by local spatial attention (LSA). Global spatial attention (GSA) is regarded as the auxiliary mechanism of the above two attention mechanisms which can model the pair-wise relationship of any two positions in spatial dimension to further enhance the features. So an effective description of the key features is generated by HAM.
Furthermore, during the observation of micro-video data, there are differences between adjacent frames as shown in Fig. 1. Intuitively, the performance is directly benefited from such dynamic variations. The effective temporal structure can urge the model to focus on moving objects or people which are useful for our task. Even in the same venue, the brightness and the related object may change obviously. Based on the above analyses, a frame difference enhanced module (DEM) is designed to capture the dynamic varieties of serial frames. The differences between adjacent frame-level features can be utilized by DEM as a guide to focus on changed features automatically. Such interactions are always ignored in related works [1, 22]. In DEM module, the feature maps are squeezed in channel-level as the approximate local temporal variations to reduce the model complexity and ensure efficiency.

The illustration of content variations between adjacent frames. The samples are selected from Micro_scene10 dataset of No. 1434, 1435 and 1436 with Garden label. These key frames are obtained from the 2-nd to 4-th seconds of micro-video by extracting 2 key frames per second.
Based on mentioned above, a Hybrid-Attention and Frame Difference enhanced Network (HAFDN) is further proposed to predict micro-video venue categories. As illustrated in Fig. 2, HAFDN has mainly two branches: (1) Content branch. A convolution network named ResNet-DT (shown in Fig. 3) is used to extract robust visual features with temporal information. HAM and DEM modules are devised to enhance the extracted features from different perspectives. In HAM module (shown in Fig. 4), three kinds of enhanced features are obtained via different attention mechanisms. ConvLSTM is following behind to learn long-term temporal information without destroying the spatial structure. In DEM module (illustrated in Fig. 5), the differences among adjacent frames can be captured after the feature extraction stage in a parallel way with HAM. (2) Motion branch. 3D-CNN (ResNeXt101 [23]) is employed to extract basic motion features. LSTM is used after 3D-CNN to learn motion changes of frames in another form and enrich the expressiveness of temporal information. Finally, two fully-connect layers with ReLU non-linearity in between are applied to fuse the features from two branches. The discriminative representations are outputted for prediction. Extensive experiments are conducted to demonstrate that the proposed HAFDN model can achieve state-of-the-art performance on both Micro-F1 and Macro-F1 scores in the Micro_scene10 dataset and competitive results in the YUP++ dataset.

The framework of the proposed method. Content branch and motion branch are contained in HAFDN model. The content branch is enhanced by HAM and DEM in parallel way after feature extraction stage. The convLSTM is applied after HAM and LSTM is applied in the motion branch. Two fully-connected layers are used as fusion stage to fuse the features from two branches for category prediction.

The architecture of (a) ResNet-DT. ResNet-DT adjusts the input stem with three convolution layers and further improves the downsample block by adding TSM module (red box) in path A and avgpool (yellow box) in path B. (b) TSM represents the process of TSM with 0.25 channel shift ratio. The kernel size, output channel size and stride size (default to 1) of convolution layers and pooling layers are illustrated on the figures.

The framework of HAM, which consist of CA, LSA, GSA and convolution layer. The kernel size and stride of convolution layer and the shape of feature maps are shown in the figure. Co denotes the concatenate operation in channel dimension.

The framework of DEM is shown in left. The details of Diff and the corresponding output are illustrated in the blue box. The ⨀ represents element-wise multiplication and ⊖ represents element-wise subtraction. The kernel size (default to 1 × 1), stride (set to 1) and output channel of convd1, convd2 and convd3 are same, except sigmoid function is used in convd3. The last frame is simply copied to complete the output sequence.
The main contributions of this work are concluded as follows:
(1) A neural network named ResNet-DT is used as the backbone to extract robust visual features of micro-videos. Such model is based on the domain-adaptive ResNet and temporal shift module (TSM) to obtain improvement over other backbones with fewer model parameters.
(2) HAM and DEM modules are designed to enhance extracted visual features. Specifically, three different kinds of attention mechanisms are used in HAM to learn discriminative semantic features in an efficient way. The dynamic venue varieties between adjacent frames, which are always ignored in other works, are modeled by DEM to supplement HAM module.
(3) A neural network named HAFDN is further proposed to generate comprehensive representations for micro-video venue recognition. Extensive experiments are conducted to verify the effectiveness of the proposed model. The results demonstrate that HAFDN can achieve state-of-the-art performance in Micro_Scene10 dataset, and competitive performance in YUP++ dataset. Some visualization results are shown to increase the interpretability of the method.
In the following, previous related works are reviewed in Section 2. Then the proposed micro-video venue recognition method is described in detail in Section 3. In Section 4, the experimental evaluations and analyses are displayed to verify the effectiveness of the proposed model followed by the conclusion in Section 5.
Visual backbone of micro-video venue
With the development of deep learning technology, robust and universalized networks are constantly emerging. Extensive CNNs are successfully applied on micro-video venue recognition tasks as the visual feature extractor. In 2016, Zhang et al. [6] simply used AlexNet as backbone to extract micro-video visual features. Instead of the shallow neural network, two different pretrained VGG16 are used in work [2] to extract venue-related and object-related features. Such method combined two kinds of features from different platforms within a hierarchical structure to yield discriminative visual features. More recently, Guo et al. [4] proposed a two branches structure with Places365-pretrained [24] VGG16 to retain the consistency of intra-class samples for obtaining better performance. Wei et al. [3] introduced a multi-modal cooperative learning model to explore the relationships among different modality information. A standard ResNet network was used in such work as the visual feature extractor. Zhang et al. [5] proposed an attention enhanced two-stream fusion network for video venue prediction. Some CNN backbones are simply compared in such work without any improvement. The backbone and visual modal features have not been paid enough attention in the above methods. In the meanwhile, temporal information learning is ignored in the visual feature extraction stage. Different from the above methods, temporal information learning and some structure adjustments are applied in standard ResNet as a domain-adaptive visual feature backbone to extract robust features for micro-video venue classification, which is named ResNet-DT.
Micro-video venue recognition
In early micro-video venue recognition studies, many works [25, 26] have been involved in the combination of manual design spatial-temporal features and machine learning classifiers, such as the histogram of optical flow (HOF) [27] and GIST [28] features. In such methods, micro-video is regarded as a set of several still images. The temporal information between frames and discriminative video-level features can not be modeled well. The large amount of micro-videos and the robust deep learning methods provide new power for the venue recognition task. LSTM [29] is one of the most popular methods used for learning long-temporal information. The features from different modalities were dealt with different feature extractors as the sequence in EASTERN [16] method. Liu et al. [30] designed a fully-connected gate module followed by LSTM to enhance temporal information features. Nie et al. [7] used trained external sounds to improve the extract audio features and increase the performance of model. Besides such methods, convLSTM [31] is proposed to learn long-term spatial-temporal relationships among feature maps with maintained spatial structures. NNeXtVLAD [17] is a normalized model of NeXtVLAD [32] which can generate robust video-level features of micro-video by considering all key frames rather than the single frame in a sparse way. Guo et al. [4] used an attention mechanism to filter redundant frame information in the visual modality and two branches structure to retain the consistency of intra-class samples. HA-TSFN method [5] is applied to enhance visual features by a simple hybrid attention module which is consists of two different attention mechanisms, but the pair-wise relationship among feature maps in spatial dimension is ignored. In general, the mentioned above methods always ignore one or some aspects of visual features in micro-video, such as salient area of spatial, enhanced temporal information, adjacent frame varieties, and pair-wise relationship among feature maps. Such aspects are crucial to effectively infer the venue categories. To attenuate the deficiencies mentioned above, a novel hybrid attention module consisting of three different mechanisms is designed to enrich the expressiveness of visual features from different perspectives. And a frame difference enhanced module is devised to automatically learn the variation between adjacent frames in the training stage of framework which is usually ignored in prior work. Furthermore, a two-branch [33] framework HAFDN is proposed to generate the video-level representation for the micro-video venue category recognition task.
Proposed method
This section describes the proposed HAFDN model, which is shown in Fig. 2, such model is the combination of two branches: (1) Content branch. This branch is designed to enhance and generate robust visual features via ResNet50-DT, HAM and DEM modules. The interaction of temporal information can be further captured via convLSTM. Each module is presented in detail in corresponding parts. (2) Motion branch. Motion branch is used to make up for content information with the motion vector. Such a branch is simply realized by 3D-CNN and LSTM. Finally, the features from two branches are concatenated as the input for two fully-connected layers to fuse features for recognizing venue category.
Content branch
Visual modal is the necessary part of each micro-video, which usually is presented in the form of continuous frames. There are about 30 frames per second in each micro-video, and the information is redundant if considering all frames. Hence, similar to ACSL [4] key frames are extracted with equal intervals to represent micro-videos. Specifically, one key frame is extracted per second to support relative continuity. The key frame is also called intra picture (I frame) which is the approximate representation of corresponding information. For convenience, a micro-video can be described in the following form:
The backbone is very important for the performance of our task, in other words, a better backbone always means better performance. ResNet50-DT is used in the feature extraction stage which details are displayed in Fig. 3. Such backbone is an improved neural network based on ResNet and temporal shift module (TSM), and the output of last convolution layer is treated as extracted feature maps.
Compared to the standard ResNet structure, the improvements can be concluded as threefold: (1) Input stem. The improvement of this part was first proposed in Inception-V2 [34] originally, such implementation is used in some successful models like SENet [35], PSPNet [36]. A 7×7 convolution is 5.4 times more expensive than 3×3 convolution [37]. So in the input stem, a 7×7 convolution can be replaced with three different 3×3 convolution layers in a deeper form, which can be seen in Fig. 3 (a). The output channel number is 32 in the first and second 3×3 convolution layer, and with different strides of 2 and 1. While a 64 output channel is used in the last 3×3 convolution layer.
(2) In the raw ResNet downsample part, a 1×1 convolution layer with a stride of 2 is used in Path A, three quarter information will be ignored. The stride order of the first and second convolution layer is exchanged to avoid information missing (shown in Fig. 3 black box of downsample block). In Path B, an average pooling layer is added before 1×1 convolution. The purpose of such change in Path B is same as in Path A. Except for these changes, other sets are same as the raw downsample module.
(3) Inspired by the temporal learning in micro-videos [16, 38], temporal shift module is inserted to improve ResNet and build ResNet-DT. The temporal information between sampled frames can be captured in the feature extraction stage by shifting part of the channels along the temporal dimension with added zero elements. The process of temporal shift is shown in Fig. 3 (b), in which only T and C dimensions of the feature map are drawn out for easy understanding. Each row represents different frames and each column represents different channels of the same feature map. In the experiment, according to [18] the shift ratio is set to 0.25 of all channels. By such process, the temporal information of adjacent three frames can be modeled. TSM module is inserted into each ’Bottleneck’ block of ResNet. Then, the output of ResNet-DT is represented as
Hybrid-attention module (HAM)
The hybrid-attention module is consist of three different attention mechanisms and a convolution layer as shown in Fig. 4 (a). Three attention sub-modules are named channel attention (CA), local spatial attention (LSA), and global spatial attention (GSA).
(1) In channel attention, it is mainly inspired by SENet [35] to make full use of contextual information and reduce useless features of scene-related (shown in Fig. 4 (b)). Specifically, a global average pooling (GAP) operation is directly applied on input feature maps X
t
. The spatial dimension of feature maps is compressed by GAP operation and generate a channel-wise tensor
(2) In local spatial attention, the structure is illustrated in Fig. 4 (c). Compared with channel attention, LAS is focused on the spatial dimension to highlight salient areas of feature maps. The channel size of input X
t
is transformed from C to 1 by two convolution layers. The operations is represented as follow:
(3) In global spatial attention, the structure is shown in Fig. 4 (d). The long-range dependencies of any two positions of the feature maps can be captured by GAS module to supplement the other two attention modules, which is similar to Non-local module [39]. The most important and relevant features are highlighted by such module. GSA is implemented by the following equation:
Finally, three types of enhanced features are concatenated on channel dimension, a simple convolution layer with 1×1 kernel size is used on the concatenated features to fuse the information of different attention types. The output of HAM module is described as follow:
In micro-video venue recognition, partial information is provided when only considering spatial features. The motion-related features are also crucial to better understand the venue in micro-video. Consequently, a frame difference-enhanced module (DEM) is designed to effectively focus on salient differences between adjacent frames while suppressing irrelevant information. In particular, the DEM module is realized in a channel-wise manner to supplement the temporal contextual information which is different from ResNet-DT. As depicted in Fig. 5, an input sequence
The venue of a micro-video is the same most time. The overall varieties are slow except pixel-level value changes of person or object in salient regions. 2D convolutions with 1 ×1 kernel are inserted to capture the approximate differences between adjacent frames, which is formulated as:
To further mining the temporal information of content features, convLSTM is used to learn the long-term spatial-temporal information while retaining spatial structures. Venue and object-related transformations are modeled in a different form in such sequence model. The output features of HAM are sequentially transported into convLSTM to augment by time step as Equation 12:
In general, a series of coherent movements is contained in a micro-video. In the Garden, the behaviors of people are usually helpful for recognizing the venue category, e.g., walk and running. Learning from [40], motion information can be processed in the form of dense frames by 3D convolution networks as the auxiliary branch. Specifically, a micro-video is divided into multiple clips. Each clip is consist of 24-frame, thus a micro-video can be defined as M
V
={ V1, V2, . . . , V
t
}. Such clips are processed by 3D CNNs to generate corresponding motion features Z
T
= {Z1, Z2, . . . , Z
t
}, where V
t
and Z
t
is the t-th clip and the corresponding extracted features. An efficient ResNext-101 [23] network, which is pretrained on Kinetics [41] is used as the backbone to extract motion features. In order to learn the exact temporal dependencies and supplement the content branch, LSTM structure is used after 3D CNN. The last hidden representation of LSTM is extracted as the output of the motion branch, which is represented as
In the fusion stage as shown in Fig. 2, two fully-connect layers are applied to fuse the concatenate features. Such stage is formulated as follow:
Finally, the cross entropy is optimized by Adam [42] method as loss function. The training process of HAFDN method is presented in Algorithm 1.
1: Construct training set M C for content branch, M V for motion branch and labels L;
2: Divide M C and M V into N batches;
3:
4:
5: Content feature maps (X1 ; X2 ; . . . X t ) extracted by ResNet50-DT
6: Use HAM module to get the output H T = (H1 ; H2 ; . . . H t )
7: Use Equation 12 to get the output
8: Use DEM module to get the U T
9: Concatenate
10:
11: Motion feature maps (Z1 ; Z2 ; . . . Z t ) extracted by ResNeXt101
12: Use LSTM and GAP to get the output Mo
13: Use Equation 13 to generate the predicted result Pre
14: Compute the crossentropy loss by Pre and y
15: Use back propagation algorithm to update weight
16:
17:
Dataset and details
The experiments are conducted on Micro_scene10 [4] and YUP++ [26] datasets. The Micro_scene10 dataset is composed of 7,236 samples with 10 different venues and 1,200 micro-videos with 20 classes is contained in YUP++ dataset. The duration of each micro-video is about 6 seconds. According to [4] and [26], the Micro_scene10 dataset is split into two parts: half is used to train and the rest to test. YUP++ dataset is divided in different proportions as 1:9 for train:test. In order to facilitate the description, M10 is used as the short for Micro_scene10 dataset.
Each micro-video only uses visual modality, 6 key frames are extracted by FFMPEG in M10 dataset. For content feature of each key frame, H = 7, W = 7 and C = 2048 is the generated tensor shape by ResNet-DT. For motion features, a micro-video can be divided into 6 clips, each clip consists of 24 frames, 2,048-D features are extracted by Kinetics-pretrained ResNeXt-101 for each clip. The baselines and the proposed method all are implemented by Tensorflow on a GTX 1080. In the HAM, the compression ratio r=4. The kernel size of convLSTM is 1×1. The output channel size of DEM is 1,024. The hidden states of convLSTM and LSTM in two branches are 1,536 and 1,024, respectively. The hidden units between two fully-connect layers are the same as LSTM. The batch size is set to 27 in M10 dataset and 20 in YUP++ dataset, the initial learning rate is 10-4 and decreased to
To validate the proposed method, the performance is evaluated by Micro-F1 [43] and Macro-F1 [44]. Each sample is weighted equally in Micro-F1 while in Macro-F1 each class is given equal weight. The distributions of Macro-F1 and Micro-F1 are the same as a closed interval from 0 to 1. The performance is the best when the values of Macro-F1 and Micro-F1 all are 1.
Results and analysis
Visual backbone and sequence model
In this subsection, several classic neural networks are chosen as the backbone to extract basic content features of M10 dataset. Generally speaking, the performance is directly influenced by the quality of features. In order to verify the backbone more efficiently, all features are got by a pooling layer after the last convolution layer. And the results are obtained by using ACSL [4] method. The raw image resolution of each key frame is 480. Without loss of generality, ResNet50 is adopted as the key backbone in such part. The experimental results are summarized in Table 1.
The performance comparison of different backbone
The performance comparison of different backbone
Several conclusions are obtained as follows: (1) VGG16 has the most parameters of all backbones, while its performance is not the best. There is an obvious upper limit on the performance of VGG16. (2) The structural adjustments of ResNet50 have significantly improved the performance according to [37]. Such improvements can be treated as a useful way to avoid losing information in the feature extraction stage. Especially in ResNet50-B, 2.4% and 3.3% in Micro and Macro-F1 scores are yielded when compared to ResNet50. The further improved ResNet50-C and ResNet50-D can increase the performance obviously. Especially ResNet50-D is much better than ResNet101 with fewer parameters. (3) Inception-v3 does not perform well as ResNet50 with high image resolution. The model achieves the second-worst result of all backbones. The combination of ResNet and Inception named Inception-resnet-v2 can achieve a better result than every single model. (4) DenseNet121 and DenseNet201 can extract features in an efficient way because their parameters are much fewer than other models. (5) In a similar model size, ResNet50-D is much better than Inception-v3 with 4.1% and 2.1% margins on two metrics, respectively. Inception-resnet-v2 is outperformed by ResNet101 with 2.1% and 0.9% margins. The increase of width and depth on the model can effectively raise the performance such as DenseNet, Inception-v3, ResNet, Inception-resnet-v2 and EfficientNets. All such models are deeper and wider than VGG16 but have fewer parameters. (6) EfficientNet-B5 and B7 are different in input image resolution. The performance of B7 is worse than B5 because in B7 the input resolution is larger than the raw image. The simple up-sampling may introduce noise information and degrade the model. (7) The best performance is achieved by ResNet152 with 85.7% in Micro-F1 and 82.7% in Macro-F1. The performance of ResNets is improved obviously with the growth of the model scale. ResNet152 is much better than ResNet101 and ResNet50. ResNet50-DT achieves the second-best results of all models. And it is slightly weak in 0.1% and 0.2% on two metrics when compared to ResNet152. Trade-off between performance and efficiency, ResNet50-DT is chosen as the backbone to extract visual features.
Then, experiments are conducted to choose the sequence model in content branch, which is followed after HAM module to capture long-term temporal information. The performance of different sequence models is shown in Table 2 when only considering HAM module.
The performance comparison of different sequence model in content branch
Among all sequence models, convLSTM achieves the best performance with 85.2% in Micro-F1 and 81.9% in Macro-F1. 0.4% improvement is got when compared to the second-best method GFCBs in Macro-F1. ConvLSTM is equal with EASTERN in Micro-F1 and 1% better in Macro-F1. The worst performance is generated by TRUMMAN which ignores the temporal information. Compared with LSTMs, an obvious gain is observed by using convLSTM with 0.7% and 1.2% on two metrics. This is because convLSTM uses convolution layers instead of fully-connect layers to retain spatial structures, and modeled long-term temporal information. In contrast, the other methods use 1-D tensors as input, the spatial structure is destroyed in the temporal information modeling stage. So the convLSTM is chosen as the sequence model in the content branch.
The performance comparison of the proposed model and other state-of-the-art methods on M10 dataset are displayed in Table 3. TRUMMAN [6] and NMCL [3] methods use three modalities such as visual, textual and acoustic modalities to generate the final venue representation. NNeXtVLAD [17], ACSL [4], HA-TSFN [5] and HAFDN methods all only consider visual features without additional modalities. From such table, the following observations are got: (1) The worst performance is achieved by TRUMMAN, which ignores the temporal information and simply uses spatial information. (2) The same issue also exists in NMCL. The performance of TRUMMAN is exceeded by NMCL method, which indicates that the performance is benefited from the attention mechanism and relationships among different modalities. Without making full use of temporal and spatial information, there is a clear gap between such two methods and other methods. (3) When considering the inconsistency within the same class and temporal information, the performance of ACSL method is much better than such two methods. The effectiveness of temporal information is verified in venue prediction. (4) The video-level features are well captured by NNeXtVLAD+ which considers all frames in equal weight to cluster final venue representations. However, the input features are processed in 2D tensor forms, the spatial information is not captured well when learning temporal dependencies. (5) The second-best performance is achieved by HA-TSFN method which integrates the attention module and more spatial information to improve the performance. (6) The differences among adjacent frames and the pair-wise relationships of any two positions in the spatial dimension of features are neglected in HA-TSFN. From Table 3, the best performance is achieved by our proposed model HAFDN with 88.1% and 85.2% on Micro-F1 and Macro-F1, respectively. 1.3% and 1.8% margins are increased than HA-TSFN. HAFDN can further raise 0.1% and 0.1% in two metrics by using ResNet152 as the backbone.
Performance of state-of-the-art methods on M10 dataset
Performance of state-of-the-art methods on M10 dataset
The parameters are analyzed on our proposed method HAFDN and other baselines such as NMCL [3] and HA-TSFN [5]. It is obvious that the HAM and DEM both increase the number of parameters. HAM module is performed on multiply feature maps to highlight different valuable areas. The differences among frames are captured by DEM in a frame by frame manner. However, the additional attention mechanism and frame difference enhanced module all effectively realize enhancement on feature maps. The better performance is obtained by HAFDN (without DEM) when compared to HA-TSFN with similar parameters. HA-TSFN has the second-best performance in all methods. 0.6% and 0.7% improvements are achieved on Micro-F1 and Macro-F1, respectively. The worse performance is obtained by the fewer parameters of model in such results. The least parameters and worst performance are simultaneously achieved by TRUMMAN. The best performance and most parameters are obtained by our proposed model HAFDN compared to other methods. 1.3% on Micro-F1 and 1.8% on Macro-F1 margins are raised with a triple amount of parameters.
The ablation studies in HAFDN are presented in Table 4. Each module and the combinations of different modules all have a positive influence on improving recognition performance. The discriminative feature representations can be generated via HAM, DEM, sequential module, and motion branch in the proposed framework. Especially only through the supplement of the motion branch, 2.4% and 1.5% improvements are achieved on Micro-F1 and Macro-F1. The differences between adjacent frames are modeled effectively by DEM with 1.8% on Micro-F1 and 1% on Macro-F1 when compared with content. In terms of performance improvement, HAM module is larger than DEM module on both the content branch and the whole framework, which indicates the effectiveness of HAM. The raise of DEM is outperformed by HAM with 0.7% and 1.3% on two metrics when only considering the content branch.
Performance comparison of different module
Performance comparison of different module
The scores on Micro-F1 and Macro-F1 of HAM further increase 0.6% and 0.8% based on the scores of DEM when both considering about content and motion information. The best performance is achieved by the combination of all sub-modules with 88.2% on Micro-F1 and 85.2% on Macro-F1, which raises 0.8% and 1.1% than second-best performance. 2% and 2.6% improvements can be got on two metrics compared to the combination of content and motion information without other sub-modules. In performance improvements, the combination of convLSTM and DEM modules is not obvious as HAM module. Such result demonstrates that as two parallel modules the combination can not better augment visual representations than HAM module. No matter only considering content branch or whole framework, the improvement of HAM on two metrics is much better than the combination of convLSTM and DEM or the single module. Three different kinds of attention-enhanced features are integrated well by HAM.
The loss trend curves are shown in Fig. 6. The train and test curves mean the loss of HAFDN in different epochs. And the rest of the curves mean the same method only without DEM or HAM module. The influence of HAM module also can be seen intuitively from loss trend curves as mentioned above. The better performance is achieved without DEM module when compared to without HAM module, no matter in the train or test sets.

The loss trend curves of HAFDN and without DEM or HAM on the train and test sets.
To determined which fusion method is used in HAM module, some experiments are conducted. Add, max, concatenation and convolution methods are selected as fusion methods to mix three types of feature maps and the corresponding results are displayed in Table 5. From the results, the best performance is achieved by the convolution method than the other three methods in terms of performance on Micro-F1 and Macro-F1. Three different types of attention feature maps are encoded effectively in such method.
Performance comparisons of fusion method in HAM
Performance comparisons of fusion method in HAM
Then the evolution of HAM module is shown in Table 6. To better verify the influence of HAM module, DEM module is not considered in related experiments. The experimental settings are consistent with the above experiments. In HAM module, visual representations are enhanced by three attention mechanisms. The performance can be increased by channel attention mechanism with 0.6% and 0.4% improvement compared to raw visual feature maps. The valuable channel information is enhanced by channel attention with different weight scores to highlight important information. The salient area of features is highlighted by LSA in reweighting the distribution of important features. 0.7% and 0.3% improvements are yielded by LSA on Micro-F1 and Macro-F1, respectively. The pair-wise relationships between any two positions of feature maps are considered by GSA in global form. So 0.1% and 0.2% improvements are got by GSA which is different from LSA in the spatial dimension. Then, in the combination of different attention mechanisms, the different sequence is connected by “-”. Such as CA-LSA means that CA is firstly used to enhance features, the enhanced features is then delivered to LSA to further highlight important feature in spatial dimension after channel attention. The meaning of CA-LSA-GSA is same as CA-LSA, which is only different in the number of attention mechanisms. CA+LSA means that the enhanced features of CA and LSA are concatenated together in channel dimension for further processing. It can be found that the performance of different attention mechanisms in concatenate method is much better than in the sequential order. Even the performance of the sequential attention mechanism is decreased when compared to a single attention mechanism. The salient area and contextual information are not benefited at the same time by such a sequential method. By contrast, the best performance is gained by CA+LSA+GSA with 86.6% on Micro-F1 and 83.7% on Macro-F1, respectively. Such result shows the necessity of attention in the framework. With the increase of attention types, parallel structures can better enhance visual feature maps. Further improvement can be got by CA+LSA+GSA than the combination of any two attention mechanisms.
Performance comparisons of different attention types and the sequence in HAM
The Gradient-weighted Class Activation Heatmap on Image (Grad-CAM) [45] method is used to visualize the feature maps via different attention mechanisms. The visualization may be negatively influenced by the temporal dependencies and differences between adjacent frames. So all submodules are excluded in the framework except HAM to better visualize the results.
Red (blue) means higher (lower) attention weight in Fig. 7, the influences of different attention mechanisms can be displayed directly. The global contextual information is focused on by channel attention from different key frames of the same micro-video as shown in Fig. 7(b). LSA and GSA tend to focus on the objects related to the venue. As illustrated in Fig. 7(c)(d), persons or certain objects are concerned. Venue information is captured well by CA and object-related information can also be enhanced from the visualization results. The response of GSA is much sparse and exact than LSA due to the pair-wise relationships and softmax function with global form. The semantic information of micro-video can be well represented and improved by HAM module in the proposed model.

Attention visualization results. The key frames are extracted from No.3128 of M10 dataset.
In order to further verify the effectiveness of the proposed method, other experiments are conducted on the YUP++ dataset to compare with some SOTA methods. Experimental results are shown in Table 7. YUP++ can be split into two types with same samples, one is obtained by static camera and another is caught by a moving camera. The extracted key frame number per second is 4 and each motion clip has 32 consecutive frames. The rest settings are the same as M10 dataset.
Performance comparisons of different method on YUP++
Performance comparisons of different method on YUP++
According to the results, the second-best performance can be achieved by the proposed method on stationary and moving, respectively. The best performance in the stationary dataset is got by MOSE (C3D) with 97.0%, and another best in moving dataset is got by HAF+Bow/FV (I3D) method with 89.6%. After analysis, it can be concluded that the proposed method only utilizes visual information while such two models use external knowledge like optical flow and another dataset to enhance the robustness of their model. Obviously, a sharp performance degradation is occurred on NNeXtVLAD+ because of the large-scale parameters and serious over-fitting. The performance of ACSL is slightly lower than SVMP and GRP. HAFDN increases 0.8% and 2.6% when compared to HA-TSTN, especially in moving data. The results also can demonstrate that the proposed method is effective no matter on M10 or YUP++ dataset. The main reason of high performance in YUP++ is the quality of the dataset. M10 dataset is collected from social platform with more samples than YUP++, which also is shot by mobile phones rather than special camera.
In this paper, a frame difference enhanced network with hybrid-attention is proposed for micro-video venue recognition. The model focuses on visual modality such as temporal information learning in visual feature extraction stage through ResNet-DT, and feature enhancement after extraction within a two-branch structure. Two different modules HAM and DEM are devised to enhance the extracted features from different perspectives. First, a novel hybrid-attention module HAM is introduced to highlight the important information of extracted visual features via different attention mechanisms in content branch. Then considering the differences between adjacent frames, DEM is devised in a parallel way to enrich the extracted features. From feature extraction to feature enhancement stage, such modules serve to boost the performance of the proposed model. Following that, the motion information from the motion branch is integrated to supplement final comprehensive venue representations of content branch. Extensive experiments are conducted on two different datasets to verify the effectiveness of the proposed model by Micro-F1 and Macro-F1 scores. And the visualization results of HAM can increase the interpretability of the model. Especially in the moving part of YUP++ dataset, the frame variations are effectively captured by the proposed model.
Some limitations are also inevitable in our work which can guide our future work: (1) A large number of parameters are introduced by using HAM and DEM modules. To mine more useful information of visual modality in micro-video, the input feature maps are extracted in the 4D-tensor form. Such operation takes more parameters and increases computation cost for training. The lightweight of modules and effective visual feature extraction method are the focus of future research. (2) The temporal information in micro-video is processed in different forms such as convLSTM, LSTM, and DEM to enrich the expressiveness of temporal features. The relationships among such information are not carefully considered to distinguish common and other unique components.
Footnotes
Acknowledges
The authors would like to thank the reviewers for their valuable comments. This research is supported by the national Key Research and Development Program of China (No. 2019YFB1406201, 2020YFB1406800), the National Natural Science Foundation of China under Grant (No. 62071434), and the Fundamental Research Funds for the Central Universities (Grant No. CUC21GZ010, CUC210B017).
