Abstract
Unconstrained video face recognition is an extension of face recognition technology, and it is an indispensable part of intelligent security and criminal investigation systems. However, general face recognition technology cannot be directly applied to unconstrained video face recognition, because the video contains fewer frontal face image frames and a single image contains less face feature information. To address the above problems, this work proposes a Feature Map Aggregation Network (FMAN) to achieve unconstrained video face recognition by aggregating multiple face image frames. Specifically, an image group is used as the input of the feature extraction network to replace a single image to obtain a multi-channel feature map group. Then a quality perception module is proposed to obtain quality scores for feature maps and adaptively aggregate image features from image groups at the feature map level. Finally, extensive experiments are conducted on the challenging face recognition benchmarks YTF, IJB-A and COX to evaluate the proposed method, showing a significant increase in accuracy compared to the state-of-the-art.
Introduction
With the wide application of deep learning in the field of computer vision, many computer vision application technologies have been developed, such as face recognition [1, 2], face anti-spoofing [3], self-driving cars [4], pedestrian re-identification [5], and facial expression recognition [6]. Among them, face recognition, as a basic application, has been paid much attention by researchers. Early face recognition methods were mainly for static images, and a series of challenging data sets emerged, such as [7–12]. However, face recognition methods based on static images have poor performance for video faces. Compared with static face images, unconstrained moving faces in videos can cause huge posture changes and motion blur, and most video frames have incomplete face information. Therefore, it is an effective method to use the multi-view and time information of the video face sequence to characterize the video face [13, 14].

Similarity measurement. The left part shows a frontal static face image and a set of large pose face image sets of the same person, and the aggregated feature has a smaller distance from the static image feature. The right part shows two sets of images of different people, with a larger distance between the aggregated features.
Compared with static face images, unconstrained moving faces have problem such as pose changes, motion blur, and incomplete face information. Using video face sequences to represent video faces is a feasible solution because it contains multiple views and time information. Video face recognition scenes include: video to static (V2S), static to video (S2V), and video to video (V2V). Specifically, in V2S, the face sample library is composed of static images, and the target face is a video sequence. The image-to-image matching problem has evolved into a more challenging image-to-video, video-to-video matching problem. For the above problem, an intuitive solution is to convert the video face into the same representation as the image face. For example, the key frames in the video sequence are used as the representation of the video face. However, incomplete video face information limits its performance. Therefore, aggregating multiple face images into a unique video face representation is more conducive to the video face recognition method.
There has been some outstanding work in video face aggregation [15–19], aggregate multiple video frames to realize the representation of video faces on images or highly compact feature vectors. The above method has been successful to a certain extent. However, we argue that aggregation at the image level or feature vector level has certain limitations. The aggregation noise at the image level is too large, which affects the aggregation effect [20, 21]. The level of feature vector aggregation makes the difference of different images in each feature channel ignored [15, 23]. Therefore, it may be a better solution to consider feature mapping and aggregation at the high feature level of each feature channel.
Based on the above analysis, this paper proposes a feature map aggregation network (FMAN) for video face recognition. The purpose of FMAN is to aggregate face images of multiple video frames at the feature map channel level. In detail, a quality perception module is used to obtain the quality score of each feature map channel, and then the quality score is used as a weight to adaptively aggregate the feature maps of multiple images into a more effective feature map in the same channel. In this way, the aggregated features of multiple images are consistent with the feature types of a single image. We measured the similarity between the image and the image set, and the image set and the image set (see Fig. 1). The results show that the aggregated vector is more beneficial for face representation. More convincingly, the evaluation results on three differently distributed face data sets show that the FMAN method is more effective than other aggregation methods.
In summary, the contributions of the paper are as follows: A Feature Map Aggregation Network (FMAN) is proposed to aggregate a set of images into the same feature space as image-based face recognition. The aggregated features have richer face information and can be applied to video face recognition. A quality-aware module is proposed to adaptively learn quality scores for different feature maps. Makes more parameters available for learning aggregation and better independence of aggregations in each feature channel. We evaluate the proposed method on the challenging datasets YTF, IJB-A, and COX, and the results show a competitive performance.
The flow of remaining paper is as follows. Section 2 describes related work, including general face recognition and video face recognition. Section 3 details our motivation and proposed method. Section 4 reports the details of the experiments, including the dataset, implementation details, evaluation results, and ablation studies. Finally, Section 5 shows the conclusions of this paper.
Deep face recognition
In the past, face recognition developed slowly and mainly based on traditional methods [24]. Driven by the deep learning technology used by AlexNet in Imagenet competition in 2012 [25], DeepFace [2] achieved the best performance in the famous LFW benchmark in 2014 and reached the level close to human for the first time (DeepFace: 97.35 vs. human: 97.53). Since then, the focus of face recognition has turned to the method of deep learning, which has made great progress in recent years.
The representative network include AlexNet [25], VGGNet [12], GoogleNet [26], ResNet [27]. Currently, some mainstream face recognition algorithms are based on the above-mentioned network. Among them, ResNet solves the problem that some parameters cannot be effectively learned due to too deep network depth, and has achieved better performance. As for the research on loss function, Euclidean distance loss is the main direction of the initial research, such as triplet loss [28], center loss [7], and coco loss [8], etc. Among them, triplet loss guides the model to learn a complex feature map, so that the feature distance from triplet anchors to negative samples is larger than that of positive samples. Center loss provides a category center for each category, minimizes the distance between each sample in the min-batch and the corresponding category center, and achieves the purpose of reducing the intra-class distance. Coco loss introduces cosine distance, which increases computational complexity and gains improvements. After 2016, cosine-based loss has gained more attention. Such as A-Softmax [9] expresses softmax loss as cosine loss and introduces cosine spacing to further maximize the spacing in angular space. CosFace [10] uses the L2 normalization of the feature and the weight vector to remove the influence of the radial difference of the feature. A cosine margin is introduced to minimize intra-class variance and maximize inter-class variance. ArcFace [11] uses angular space instead of cosine space in Cosineface to maximize classification bounds. In addition to these mainstream methods, other special depth algorithms also performed well under specific constraints, such as lightweight network SqueezeNet [29] and MobileNet [30], which are more suitable for mobile devices due to fast inference speed. The performance of the above methods is shown in Table 1.
Performance of deep face recognition methods
Performance of deep face recognition methods
The existing video face recognition methods are mainly divided into two categories: video-based and image set based [20]. In the video-based method, each video is regarded as a set of ordered image sequences, and image information and video frame related information are used for identification at the same time [19, 32]. This method solves the general video face recognition problem by modeling the relationship between frames. In the method based on image sets, each video is regarded as a group of images, and the relationship between image and video, video and video is regarded as a point-to-set and set-to-set relationship, respectively [16, 34].
Motivation
Recently, due to the excellent performance of CNN in feature extraction, many studies have tried to aggregate the facial features of video frames extracted by CNN to obtain a unique representation of the video or image set. Most of these methods are aggregated compact face representations extracted from deep networks, and the most direct method is to use the largest or average pool [31, 36]. However, the performance of the above method is limited by the number of samples and noise. Therefore, weighted aggregation of feature vectors by predicting quality scores has become a more effective method [38, 39]. For example, Ranjan [15] introduced face detection scores into quality discrimination, Yang [37] and Liu [18] predicted quality scores through input images, and then weighted each feature vector. Recently, Gong [23, 25] has achieved aggregation through multi-mode recurrent networks that pay more attention to high-quality feature vectors, and achieved better performance. The aggregation network proposed in this paper is similar to these works, but instead of aggregating at the feature vector level, we aggregate at the feature map level where the features are relatively independent and more parameters can be learned.

Training and testing process based on DCNN. The upper part is the training process, and the lower part is the inference process.
We summarized the general flow of the face verification system from model training to face verification (see Fig. 2). The face image is input into the deep network to obtain the vector representation of the specified dimension, and then the loss is obtained through the fully connected layer and the loss function. Finally, gradient descent and backpropagation techniques are used to update the network parameters, and the face representation model is obtained. In inference, the face image is represented by a vector through the model, and then the face is verified according to the distance measurement and threshold setting.
However, in more unconstrained real scenes, a single face image only contains part of the face information, such as partial face occlusion, side face, etc. In this case, even the best face representation model cannot obtain sufficient face features. Therefore, aggregating multiple face images has become a popular solution. An intuitive method is to average the feature vectors of a group of face images, which is called the average pool [31, 35]. However, the average pool can offset the noise only when the sample size is large enough. Therefore, [37] and [18] reduce the impact of noise in aggregation by learning the weights of each feature vector, and the aggregation method of learning quality scores can be expressed by the following formula:
However, low-quality face has a strong representation on some features, and the overall weighting of feature vectors will weaken the useful features of low-quality face images. Therefore, [23] performs feature aggregation by learning the weight of each feature vector component, thereby enhancing the facial feature representation and weakening the influence of noise. The aggregation method is expressed as follows:

Overall framework of the aggregation network.
The above methods have gained benefits, but we believe that the method of directly aggregating compact feature vectors obtains a compact face representation, but it is not conducive to the aggregation of face features. Feature maps have lower-level feature information than feature vectors and are more independent. Therefore, this paper proposes a feature map level aggregation method. Considering the independence of each channel, we introduce quality perception to aggregate the input image set of each channel separately. The aggregated feature map has the same structure and meaning as the feature map of a single image, due to the quality score is converted into a probability representation. With reference to image-based processing methods, after stacking appropriate convolutional layers, global convolution is used to obtain a face representation that incorporates multiple image features.
The network structure of FMAN is shown in Fig. 3. First, the deep convolutional neural network is used to obtain the multi-channel features of the input face image, and the output features are used as the input of the aggregation network; In the aggregation module, the multi-channel features corresponding to multiple faces are reconstructed; The aggregated multi-channel feature map has the same form and meaning as a single image. In addition, similar to image processing, the aggregated multi-channel feature map undergoes two convolutional layers and global convolution to obtain a more compact vector representation.
Feature map extraction module
FMAN’s feature map extraction module is a popular deep convolutional neural network, and this paper chose ResNet101 due to its outstanding performance [27]. The difference is that the feature output of FMAN is a multi-channel feature map. In addition, we reserve the conv1-conv5 part of ResNet101, and discard the average pool, 1000-d fc and softmax layers after conv5. Among them, the multi-channel feature map output by the conv5 part is used as the input of the aggregation module.
Aggregation module
Suppose T ={ I1, I2, …, I N } is a group of face images to be aggregated, H (·) be the basic DCNN module of FMAN. For a single face image, its feature mapping is F ={ f1, f2, …, f N }. Where, f i is the three-dimensional data with the number of channels D and the size w × h. Each channel of f i represents part of the features extracted by the basic DCNN module to distinguish faces. Therefore, we separately aggregate the features of each channel.
In order to aggregate the characteristics of each channel, the feature map set is first reconstructed according to the number of channels. Suppose F′ is the set of feature maps after reconstruction, then
A feature map quality perception module is designed by us to perceive the quality score of the feature map. Since the quality score of the feature map depends on all the feature maps of the channel where the feature map is located, the quality perception module takes

Quality perception module of feature maps.
Since the aggregated F″ has the same meaning and size as the feature map f i of a single face image, some methods for image features are used to further process the feature representation of F″ to obtain a more compact feature representation. We add two convolutional layers after the aggregation layer to enhance the representation of the aggregation features, and use the Relu function at an appropriate location to increase the nonlinear expression. In addition, a fully convolutional layer and a fully connected layer replace a single fully connected layer in the last layer of the network in order to obtain a compact feature vector representation and a specified dimension output respectively.
The quality perception module reads a set of feature maps in the reconstructed single channel as the perception object, and generates the corresponding quality score for each feature map. We believe that the richness of local face information contained in a feature map should not only be obtained through the feature map itself, but also need to balance all the feature maps in the channel so that the feature map with less information is suppressed, while the feature map with more information gets more attention. To this end, we first designed a shallow neural network to obtain the mapping of the feature map in the mass space, and then obtained the corresponding probability score of all the mapping values through a softmax operation.
The quality score learning module we designed is shown in Fig. 4, including 3 convolutional layers, 3 densely connected layers, and a softmax layer. The input feature map first passes through two convolutional layers in parallel, and then the dimension of each channel is reduced to a set of feature values using the global convolutional layer. Finally, three fully connected layers are designed to reduce the eigenvalue set to one eigenvalue. As a result, the input feature map set is converted into a feature value set containing quality information. As a result, the input feature map set is converted into a feature value set containing quality information. In addition, our output feature values are converted into quality scores through softmax, so as to facilitate weighted aggregation in the channel. The specific formula is as follows:
As an end-to-end network, there are two parameter learning modes that can be selected: Train the whole FMAN, and the parameters of each part are updated synchronously; Use the static face image data set to train the basic DCNN, use the pre-trained model to extract the feature map, and finally train the aggregation network separately based on the feature map.
This paper chooses the second mode for experimentation, and there are two main reasons: The focus of this paper is not to extract facial features, but to aggregate different facial images to obtain more accurate facial representations. It is not conducive to analyzing the performance of the aggregation module, when the DCNN module is trained together with the aggregation module.
Based on the above analysis, we first train the basic DCNN module. In order to obtain a more accurate feature representation, ResNet101 is selected as the main network architecture. Since the latest ArcFace directly maximizes the classification distance in angular space [11], making it more suitable for deep network structure and multi-class output than others. Therefore, ArcFace is used as the loss function of the network, and ArcFace can be expressed as:
For the training of the aggregation module, we chose the basic softmax loss function due to its shallow network structure. Softmax loss can be formulated as:
In this section, we first introduces the datasets and protocols. Then, the baseline method and implementation details are described. Finally, the evaluation results of the FMAN network on three video face recognition datasets (YTF [40], IJB-A [41], and COX [42]) are reported.
Datasets and protocols
We first train the basic DCNN module using a large static face dataset (MS-Celeb-1M) to obtain the face representation of a single face image. On this basis, the aggregation network is pre-trained with the CASIA-WebFace dataset to obtain the aggregation performance. Finally, in the evaluation of three widely used public datasets (YTF, IJB-A, and COX), the aggregation network is fine-tuned on the training set specified in the protocol and evaluated on the test set.
For the MS-Celeb-1M data set, the basic DCNN module is learned using conventional methods. For the CASIA-WebFace dataset, we reallocate the training data. Since the purpose of the aggregation network is to aggregate multiple face vectors to obtain stronger characteristics, this means that the training data of the module should be a group of face images instead of a single image.
In summary, the details of all datasets are shown in Table 2 below.
The video face is the brief information of the dataset
The video face is the brief information of the dataset

The case of loss during training.
To show the advantages of aggregation at the feature map level more intuitively and comparatively, we design four baseline experiments to illustrate FMAN’s aggregation performance, including the basic model, average pool, quality score, and component aggregation. To ensure fairness, all baseline experiments use basic DCNN models that have been trained in FMAN to extract features. In the following paragraphs, we will specify how to obtain the distance between the two groups of face images in each baseline experiment. Let T1 and T2 be two sets of face images, where the number of images in T1 and T2 is N and M respectively, and F is the corresponding feature vector set extracted from T through basic DCNN, where
We experiment our method on the YTF dataset and compare with popular methods. The experimental results in Table 3 show that our method achieved an accuracy rate of 96.64%. Compared with the four baseline methods, FMAN has a higher accuracy rate, which shows that our aggregation method is more effective. Compared with other latest methods, FMAN is only lower than InsightFace with the highest accuracy, while InsightFace is an image-based face recognition algorithm, which can be embedded in FMAN as a basic DCNN module to obtain higher accuracy. Compared with the aggregation-based recognition algorithm, FMAN improves the best performing C-FAN by 0.15%. Fig. 6 shows the ROC curves for FMAN and the four baseline methods.
Performance comparison of FMAN and other methods on YTF dataset (*: face recognition method based on aggregated video or image set)
Performance comparison of FMAN and other methods on YTF dataset (*: face recognition method based on aggregated video or image set)

ROC curves of different baseline methods and our FMAN.
1: 1 face verification performance evaluation on the IJB-A dataset
1: N face recognition performance evaluation on the IJB-A dataset
Rank-1 recognition rate under different V2S / S2V settings on the COX dataset
We evaluate our method on the IJB-A dataset with 1:1 and 1:N search protocols, respectively. For the 1:1 face verification task, Table 4 shows the relationship between true acceptance rate (TAR) and false positive rate (FAR). For the 1:N face recognition task, Table 5 shows the relationship between true positive recognition rate (TPAR) and false positive recognition rate (FPAR) as well as the accuracy of Rank-N. Additionally, we report the numerical results of the proposed FMAN method and baseline method, and compare them with other methods.
According to the experimental data, for 1:1 face verification, FMAN achieved the best performance at FAR=0.1 and FAR=0.01. For 1:N face recognition, FMAN has better or comparable performance compared with other methods under most indexes, only slightly lower than DAC under the Rank-1 index. In addition, compared with the evaluation results on the YTF dataset, FMAN has a more obvious performance improvement on IJB-A with greater face changes (YTF: 1.36% vs IJB-A: 3.75% (FAR = 0.001)). Therefore, FMAN is more suitable for video face tasks with richer changes such as face poses.
Rank-1 recognition rate under different V2V settings on the COX face dataset
Rank-1 recognition rate under different V2V settings on the COX face dataset
We follow the COX dataset protocol to evaluate the performance of FMAN. Table 6 and Table 7 list the Rank-1 recognition rate of FMAN and other advanced algorithms. In Table 6, Vi-S(S-Vi) indicates that the i-th video set is used as a probe (gallery) and the static face set S is used as a gallery (probe). Similarly, Vi-Vj in Table 7 represents independent experiments conducted with the i-th video set as a probe and the j-th video set as a gallery.
It can be seen from Table 6 and Table 7 that for different experimental settings, V2S and S2V recognition tasks are more difficult than V2V, which shows that the difference between the image and video has a certain interference to the recognition task, which affects the performance of FMAN in V-S and S-V tasks. As for the performance of FMAN, not only did FMAN achieve better performance than the baseline method in all experimental Settings, but compared with other methods, FMAN performed better than the comparison algorithm in the other four V2S/S2V experimental Settings except V2-S and S-V2, as well as the six V2V experimental Settings with the same distribution.
Conclusions
To better realize unconstrained video face recognition, a feature map aggregation network (FMAN) is proposed. FMAN can adaptively aggregate the feature maps of each feature channel in the image group. Compared with the multi-channel feature map in a single image, it has the same form and meaning, but contains more comprehensive information. By reducing the dimension of each channel after aggregation, multiple images can be represented as compact feature vectors. The proposed method achieves competitive performance on YTF, IJB-A, and COX datasets, confirming the effectiveness of the method. The ablation experiment also confirms that the more independent features and the larger the parameters are, the aggregation at the feature map level is more conducive to the aggregation of face features.
Furthermore, since aggregated feature vectors and image-based feature vectors share the same feature space, the proposed FMAN can be uniformly applied to the tasks of image, image set, and video face representation. Our future work will try to use a more precise attention mechanism at the feature map level to improve the aggregation performance.
Footnotes
Acknowledgements
This work was supported by Nondestructive Detection and Monitoring Technology for High Speed Transportation Facilities, Key Laboratory of Ministry of Industry and Information Technology, and the Fundamental Research Funds for the Central Universities (NO.NJ2022016), Natural Science Research of Jiangsu Higher Education Institutions of China (Grants No. 22KJD140004).
