Abstract
In this paper, an enhanced discriminative feature learning (EDFL) method is proposed to address single sample per person (SSPP) face recognition. With a separate auxiliary dataset, EDFL integrates Fisher discriminative learning and domain adaptation into a unified framework. The separate auxiliary dataset and the gallery/probe dataset are from two different domains (named source and target domains respectively) and have different data distributions. EDFL is modeled to transfer the discriminative knowledge learned from the source domain to the target domain for classification. Since the gallery set with SSPP contains scarce number of samples, it is hard to accurately represent the data distribution of the target domain, which hinders the adaptation effect. To overcome this problem, the generalized domain adaption (GDA) method is proposed to realize good overall domain adaptation when one domain contains limited samples. GDA considers the both global and local domain adaptation effect at the same time. Further, to guarantee that the learned domain adaptation components are optimal for discriminative learning, the domain adaptation and Fisher discriminant model learning are unified into a single framework and an efficient algorithm is designed to optimize them. The effectiveness of the proposed approach is demonstrated by extensive evaluation and comparison with some state-of-the-art methods.
Introduction
Single sample per person (SSPP) face recognition is a great challenge in artificial intelligence and pattern recognition, where only single sample per person (SSPP) is available in the training set, and approaches developed for multiple samples per person (MSPP) are not effective. In real world applications, SSPP face recognition is quite realistic and important. Mostly, in such scenarios, the training face samples are limited, and it is difficult to collect more samples per person. For example, in the application of passport identification, law enforcement, and gate ID identification; only one profile face image per individual is archived in the database. This leads to SSPP problem. Numerous techniques have been exclusively designed to deal with SSPP face recognition. These approaches can be categorized into three main classes: image partition [2, 7], virtual sample generation [25, 26] and generic learning [17, 28].
To address SSPP face recognition, in this paper we employ an auxiliary set which could be different from the gallery set and propose enhanced discriminative feature learning (EDFL) method. The proposed EDFL incorporates generic learning and domain adaptation. In EDFL, the auxiliary set is treated as the source domain, the gallery and probe sets are treated as the target domain. EDFL aims to learn the discriminative knowledge from the source domain which is optimal for the classification task in the target domain.
Related work
As the proposed method is also a generic learning method, we provide a detailed survey about the third class. More interested readers are encouraged to consult [11, 16] for detailed reviews about image partition and virtual sample generation for SSPP. Essentially, the image partition-based and virtual sample generation-based methods make the best use of the one face sample for each person. Generally, these methods are incompetent to handle complex variations common in faces such as extreme expression and pose, varied illumination and occlusion. Except for the gallery set, generic learning-based methods utilize a new training data set, i.e., generic set, which includes multiple samples and sample labels for each subject. With the generic set, the intra-personal as well as the inter-personal variations of each subject can be explored in the training process. The generic set used in generic learning-based methods can be either dependent or independent of the gallery set. Depending upon the relationship between the generic set and the gallery set, we further divide the generic learning-based methods into two sub-classes, i.e., approaches utilizing dependent generic sets and separate generic sets.
Deng et al. [5] proposed extended sparse representation based classification (ESRC) to deal with SSPP face recognition. The main idea of ESRC is to develop an intra-class variant dictionary from the generic training set, the the dictionary can be used to encode the possible variations between the gallery and test images. Li et al. [14] designed a sparse representation model which ensures not only the sparsity of representation coefficient but also the robustness for the variational information. Aiming at encoding the occlusion information in SSPP face recognition, Yu et al. [31] developed discriminative multi-scale sparse coding technique via a learned dictionary from the generic set. Wei et al. [22] proposed the robust auxiliary dictionary learning (RADL) framework, which simultaneously achieves the goal of auxiliary dictionary learning and robust sparse coding. Gao et al. [7] presented regularized patch-based representation for SSPP to encode possible variations in the test images. These methods assume that the generic set and the gallery set share the similar intra-class and inter-class variations, which does not necessarily hold in all cases. For instance, the generic face dataset may be a new database containing face images captured from different ethnicities or under different imaging conditions compared to the gallery set. In such situations, the above-mentioned methods may not perform that well.
To advance the generalization of generic learning, methods employing separate generic dataset (i.e., the new auxiliary set independent of the gallery) are proposed. As the auxiliary set is distinctive from the gallery set, the relationship between them must be considered. For example, both Su et al. [21] and Kan et al. [12] proposed to infer the within-class and between-class scatter matrix of gallery set by using the intra-class and inter-class scatter matrices computed from the auxiliary set. While the former work adopts a coupled linear representation, the latter one leverages k-NN regression or Lasso regression to characterize the relationship. Besides, dictionary learning technique has also been explored. Yang et al. [27] proposed the sparse variation dictionary learning (SVDL) to learn the gallery set adaptive dictionary.
In order to capture the non-linear variations caused by poses, Mokhayeri and Granger [17] proposed a paired sparse representation model by jointly using variational dictionary and gallery dictionary augmented with a set of synthetic images. Recently, Cuculo et al. [3] considered to adopt the large datasets of images acquired in the wild, then proposed sparsity-driven sub-dictionary learning framework to handle the various variations in the probe faces. Convolutional neural networks (CNN) based deep learning framework have also been developed to address SSPP face recognition. These methods naturally require a larger sized and independent generic dataset to train the model. Yang et al. [28] proposed to extract local adaptive convolution features of images by embedding joint and collaborative representation into CNN framework. Further, Yang et al. [29] developed a novel class-level joint representation framework. As the data augmentation techniques, generative adversarial neural networks (GANs) are also used for addressing SSPP face recognition. Zhang et al. [32] proposed to learn illumination variations with GANs, and developed a model to specifically deal with SSPP face recognition with different illuminations; similarly, Wu et al. [24] proposed the multi-domain dictionary learning (MDDL), which employs GANs to learn different data styles, such as smile face, occlusion face, from the source domain data, and augmented target data with these learned data styles. Our wok is somewhat similar to MDDL, since both use domain adaptation to capture the probe face variations. Nevertheless, the proposed EDFL is quite different from MDDL in the way to achieve domain adaptation. MDDL adds the styles learned from the source domain to the gallery set, so that the target domain owns the similar styles of source domain; while EDFL uses learned projections to map both the source domain and target domain in a common subspace, in which the data distributions of these two domains are minimized
Motivation and contribution
Motivation
Most of the above generic learning methods rely on dictionary learning to characterize the possible face variations of probe face, which require that the gallery dataset and the auxiliary dataset share the same or similar data distribution. In practice, however, a separate dataset which is independent of the gallery set is more available. When this kind of generic sets have quite different distributions from the gallery set, the performance will dramatically drop. Kan et al. [13] demonstrated that the recognition accuracy drops from 96%to 59%, when use the face recognition model trained from Mongolian face images to recognize Caucasian faces. This is because the two datasets own different distributions, i.e., they come from distinct domains (named source and target domain, respectively). Consequently, the knowledge learned from the original subspace of generic set (source domain) might not be best for conducting matching between the gallery and probe images (target domain).
To make full use of information embedded in the source domain, reducing the discrepancy between the source and target domain is essential. Domain adaptation or transfer learning has been a popular technique which is able to hold the distribution mismatch between two domains. Moreover, to achieve the goal of face recognition, discriminative learning is indispensable. As a result, there will be two subspaces to be learned, one is for domain adaptation and the other is for discriminative learning. Motived by [9], learning the two subspaces simultaneously may make them best for each other, thus produce good face recognition performance. Therefore, in this paper, we integrate domain adaptation and Fisher discriminative learning in a unified framework to address SSPP face recognition.
Contribution
The main contribution of this work is as follows. Firstly, we design a generalized domain adaptation (GDA) model to realize good overall domain adaptation between the gallery set containing limited samples (target domain) and the auxiliary set (source domain). Since the gallery set of SSPP face recognition contains only one sample per person, it is difficult to accurately represent the real distribution of face images in the probe set (target domain). The proposed GDA seeks both the local and global structure information across domains to minimize the distribution mismatch between the auxiliary set and gallery set. GDA guarantees that the discriminative information exploited from the generic set can be compatible with the gallery and probe set. GDA is helpful to achieve better domain adaptation for situations where one domain has too limited samples to characterize the latent real distribution of the domain. Secondly, we consider the discriminative learning of the generic set, and integrate it with GDA into a unified learning framework to deal with SSPP face recognition, i.e., enhanced discriminative feature learning (EDFL). In EDFL, domain adaptation and discriminative learning can be accomplished simultaneously. Thirdly, we conduct extensive analysis and evaluation of EDFL to show its effectiveness. We also demonstrate that EDFL works well on deep features.
The remainder of this paper is organized as follows. Section 2 presents the details of the proposed EDFL method, including its formulation and optimization. In Section 3, we evaluate EDFL on three widely used databases under two scenarios, i.e., (i) enhanced discriminative feature learning across ethnicity and (ii) enhanced discriminative feature learning across imaging condition. Finally, we conclude the work in Section 4.
Enhanced discriminative feature learning
Notations
Let
Generalized domain adaptation
In general, face images in the source domain and the target domain are captured from different ethnicities, under different periods or different imaging conditions. This causes the two domains to have different feature spaces or different marginal probability distributions. To reduce the domain difference, Pan et al. [19] proposed transfer component analysis (TCA) to project source and target data onto a learned transferring subspace for domain adaptation. TCA diminishes the distance between different distributions across domains by minimizing the maximum mean discrepancy of transformed features in the latent subspace, meanwhile preserving the property of source and target data by maximizing the data variance. The formulation of TCA can be expressed by the following:
In Equation (1),
The gallery set of SSPP face recognition comprises of only one sample per person, it lacks substantial images which involve various variations of faces. Therefore, the mean of the target training set cannot characterize the distribution of real face images well. In such situation, TCA will fail to reduce the distribution mismatch between two domains and domain adaptation is far from being realized. To address this issue, we propose the generalized domain adaptation (GDA) method, which utilizes the partial information exploited from the source domain data to re-estimate the real mean feature of the target domain, and reduces the distribution mismatch between two domains both globally and locally. Figure 1 illustrates the pipeline of GDA, it consists of two parts, which are designed to realize global and local domain adaptation, respectively.

The pipeline of GDA (Take κ = 3 as an example).
In Fig. 1, Step ding172–ding175 are used to generate virtual face images according to the gallery set in the target domain and auxiliary dataset in the source domain, which means to augment the sample size of target domain. Figure 2 shows four gallery faces (the upper two and bottom two are from FERET and LFW dataset, respectively) and their corresponding virtual face images, each person has four virtual images and each row represents one subject. The virtual face images are produced when CAS-PEAL dataset is used as the source data. The reason for generating virtual face images is inferring the possible face variations in target domain. Specifically, as the gallery set of SSPP is hard to represent the distribution of the target domain, we utilize the face variations in the source domain to roughly infer the variations in target domain, so that we can also roughly estimate the data distribution of target domain. It is obvious that the estimated distribution by virtual images is not the same as the underlying true data distribution of target domain, however, as the virtual images do contain some face variations, the estimated distribution should be more reasonable than that one reflected by the original gallery set, which contains single sample per person. Therefore, it is helpful for the global domain adaptation.

The gallery face images (1st column) and corresponding virtual face images (2nd –5th column).
Specifically, the proposed GDA works as follows. Compute the class mean Obtain the sample variation For each single sample Use X
t
and ‘Variations’ of source domain to create an augmented dataset X(T) in the target domain, each sample in X(T) is obtained by Global domain adaptation: Use the new dataset in the target domain X(T) and the original dataset in source domain X
s
to seek global domain adaptation. Note that there are multiple samples per subject in each domain. Local domain adaptation: Use the new dataset X(S) = [m1, m2 ⋯ , m
c
1
] in the source domain and the original dataset X
t
in target domain to realize local domain adaptation. For the i - th sample in X
t
, let Generalized domain adaptation: By combining global and local domain adaptation in the above, the GDA model can be formulated as the following:
where
The first and second term of Equation (3) are used to reduce the distance between the global means of two domains, the distances between all coupled local means across two domains, respectively. The left two terms of Equation (4) are applied to preserves the maximum covariance of two domains globally and locally, respectively. Consequently, GDA attempts to minimize the distribution mismatch meanwhile preserves the maximum covariance between two domains from both global and local perspective.
Note X(T) is an augmented dataset of X
t
, each sample in X(T) is obtained by adding a variation
Hence, in the global domain adaptation, GDA appends the variations to the target domain, i.e.,
After obtaining the transformation matrix R for domain adaptation, we can get the new representations for source and target data in the latent transferring subspace by Y
s
= R
T
X
s
, Y
t
= R
T
X
t
. In the latent subspace, Y
s
and Y
t
have the same marginal distribution or own the same feature subspace. To further conduct classification task, we can learn a discriminative subspace
In Equation (6–8),
Note that the discriminative subspace W is learned after projecting the original data into the learned transferring subspace R, i.e., R and W are learned sequentially. There is one problem of this scheme: sequential learning of R and W means that W is learned after finishing learning R, which cannot ensure the new representations obtained by the transferring matrix R are optimal for generating the discriminative subspace. To solve this problem, we propose to combine the learning of R and W into a unified framework, which is able to simultaneously learn the optimal R and W. This framework can guarantee that the learned R is best for W, and vice versa. Also, to enable the discriminative subspace W to be robust to outliers and noise, we adopt the ℓ21 norm of W as a sparsity regularizer for feature selection. By considering joint generalized domain adaptation and discriminative subspace learning, we formulate the objective function of EDFL as
In Equation (9), γ is a regularization coefficient controlling the extent of feature selection, a larger value of γ means that more rows of W shrink to zero, so that fewer features are selected.
To make the objective function solvable, we relax the above optimization problem to be the following one:
As the optimization problem of Equation (13) is not convex for all variables, we exploit an alternative optimization strategy to iteratively solve the projection matrices (R, W, Q) until the objective value of Equation (13) converges. It can be solved by augmented Lagrangian multipliers (ALM) method. The detailed solution process can be found in supplementary.
For clarity, the overall algorithm to solve the optimization problem of EDFL formulation is summarized in Algorithm 1. The objective value of Equation (13) monotonically decreases as the number of iterations increases, and usually converges within 20 iterations in our experiments. By unifying R and W into the same optimization problem and optimizing them alternatively, we can assure that the transferring learning subspace R is optimal for transferring learning, meanwhile W is the best discriminative projection for the new representations in the transferring subspace. For samples in X
t
, the features are extracted by fea(t) = W
T
R
T
X
t
, and
Datasets
FERET [20] dataset. It consists of 13,539 facial images corresponding to 1,565 subjects, who are diverse across ethnicity, gender, and age. Most of the face images in FERET dataset are of individuals of Caucasian race. In the following experiments, we select five subsets from the FERET dataset, i.e., training set, fa, fb, dup1 and dup2. The training set contains 540 images from 180 individuals, each individual has 3 images; fa includes 994 images of 994 persons, one image per person; fb contains 812 images captured under another facial expression; dup1 includes 736 face images captured from different period; dup2 contains 228 image, it is a subset of dup1 and images were captured over the period of one year. The FERET data subset can be used as either source data or target data in the experiments. If it is used as source data, then only the training set is selected; if it is used as target data, then fa is selected as the gallery set, while fb, dup1 and dup2 are used as three probe sets. Sample images in this database are shown in Fig. 3(a).

Sample images from (a) FERER database, (b) CAS-PEAL-R1 database, (c) LFW database.
CAS-PEAL-R1 [8] (CAS-PEAL in the following) database. It is the largest Chinese face database for training and evaluating face recognition methods. The face images in this data set were taken with various variations such as expression, accessory, lighting, etc. The training set consists of 1200 images of 300 subjects, 4 images for each individual; the gallery set contains 1040 images from 1040 subjects with a normal condition; for the probe set, five subsets are selected in our experiments: accessory (2285 images), expression (1570 images), lighting (2243 images), background (553 images) and distance (275 images) subset. When the CAS-PEAL dataset is selected as source data, it means the training set is used; whereas when it is selected as target data, it indicates that the gallery set and five probe sets are used for evaluating face recognition methods. Figure 3(b) shows some sample images in this database.
LFW [10] dataset. It contains images of 5,749 individuals taken under an unconstrained setting. Face images in LFW database involves diverse variations such as pose, scale, lighting, hairstyle, expression, partial occlusion. The complex surroundings of image capturing and inaccurate alignment of faces makes the LFW data quite challenging for face recognition in the SSPP setting. LFW-a is a subset of the LFW dataset, and the images in LFW-a have been aligned with a commercial software tool. In this subset, we gather the subjects containing no less than ten samples and then get a dataset with 158 subjects from LFW-a database. LFW subset is used as either source training set or target probe set. Since there are no frontal face images in this dataset, the mean face of each person is used to construct the gallery set. We illustrate some sample images in Fig. 3(c).
In this work, we focus on the unsupervised domain adaptation, hence the source domain training data is labeled, while the target domain training data is unlabeled. In the evaluation of face recognition methods, the nearest neighbor classifier with cosine distance is adopted for classification, and rank-1 recognition rate is reported for each experiment.
In the experiments, each database is treated as either the source data or the target data. The target data contains gallery set and probe set, the source data is seen as the auxiliary dataset or generic set. For SSPP face recognition, the gallery set together with the auxiliary set is used for training the EDFL model, then the model is tested on the probe set.
In all experiments, the face images are aligned according to the labeled eye locations, and then resized to 64×64 pixels for all datasets. PCA is applied to reduce the dimensionality of input raw features with 99.99%energy maintained. As the first 3 principal components of PCA usually contains more information about variations, we eliminate them in the following experiments to improve the recognition performance. Please note that PCA is learned from the source data and then directly used on target data. The parameters of all involved methods are tuned to report the best results unless otherwise specified. In GDA, two parameters, i.e., κ and k, are used to find nearest neighbors, we set κ = [c1l/c2], where [] denotes rounding up to an integer. The value of κ makes the size of datasets in the source and target domain nearly equal. We tune k = 30 in all experiments. The parameters in the proposed algorithm, i.e., μ, λ and γ are empirically set to 1, 0.01 and 1000, respectively.
Analysis of EDFL
To measure EDFL’s effectiveness for SSPP face recognition task, we have conducted three sets of experiments. The three groups of experiments are: F⟶C, C⟶L and L⟶F. F⟶C denotes FERET and CAS-PEAL dataset are used as the source and target domains respectively. Similarly, C⟶L identifies CAS-PEAL and LFW dataset are used as the source and target domain respectively and in L⟶F group of experiments, LFW and FERET dataset are used as the source and target domain respectively.
The effect of generalized domain adaptation (GDA)
We validate the essentiality of GDA of SSPP face recognition by investigating the following four designs, i.e., EDFL using GDA and EDFL without using GDA (denoted by EDFL_wot). Figure 4 displays the respective recognition rates. As seen in the figure, EDFL outperforms EDFL_wot significantly. For EDFL_wot, using the global information of gallery set in the target domain is unable to capture the accurate distribution of the target domain, which leads to poor adaptation across domains. Therefore, EDFL has produced good domain adaptation results, which boost their performance on SSPP face recognition.

The recognition rate of EDFL with and without using generalized domain adaptation.
In EDFL, the domain adaptation is achieved by global and local domain adaptation. The local domain adaptation is one novelty of this work, and it plays an important role in the overall domain adaptation. The parameter k involved in EDFL is used to determine the size of local domains, which affects the effectiveness of domain adaptation. To test the influence of k on the recognition performance, we have conducted experiments on three databases. Let the value of k varies from 20 to 300 with an increment of 20, we investigate the changes of recognition accuracy of EDFL on three databases. The results are shown in Fig. 5. It can be observed that the performance roughly increases first and then decreases or keeps stable with the increase of k. For all three databases, EDFL is able to achieve promising results when k = 120. This implies that local domains with such size is sufficient to cover the local structure information across domains.

The performance of EDFL w.r.t. different value of k.
Since the optimization problem is solved in an alternative fashion, and the sub-optimizations of W and Q also utilize iterative strategy, we have explored the convergence of the algorithms involved in the optimization. While it is hard to prove its convergence theoretically, we have shown some empirical evaluation. Here, we show the objective values corresponding to the optimization of W, Q and the overall EDFL. Meanwhile, we also display the variations of recognition accuracy of EDFL with the iterations of the Algorithm 1. The results are shown in Fig. 6. As seen, all algorithms steadily converge after a number of iterations. Specifically, the sub-optimizations of W and Q converge within 10 iterations, while the overall optimization of EDFL converges in about 5 iterations, and the recognition performance of EDFL keeps stable in less than 10 iterations.

The convergence curve of sub-optimization of W, Q and over-all optimization in EDFL on three databases.
The main computational cost of Algorithm 1 is spent on solving eigen-decomposition (line 4) and computing the gradient (line 14).We make runtime test of training process using a PC with Intel Core(TM) i7 2.3 GHz CPU, 16G RAM and Matlab 2015a software. The max iteration in the algorithm is set to 30. The CAS-PEAL database is treated as the target domain, in which the gallery set contains 1040 samples, when the FERET (540 training samples) and LFW database (1580 training samples) are employed as the source data, respectively, the training time of EDFL is 65.11s and 109.83s, respectively; the highest training time of other compared generic learning methods is 52.25s (DTL method) and 90.26s (DTL method). Thus, the time complexity of our proposed EDFL is relatively high and does not own the advantage over other methods.
According to the analysis about time complexity, we clarify that when EDFL applied in a wide range, it is recommended EDFL is used for closed set face recognition, where the training process can be done offline.
To further validate our method, we conduct more experiments and compare it with several existing approaches which are designed for SSPP face recognition problem. All of these compared approaches are generic learning based, i.e., they adopt an auxiliary dataset containing multiple samples per person to address SSPP problem. There are Fisher’s Linear Discriminant (FLD) [1], Adaptive Generic Learning (AGL) [21], Adaptive Discriminative Learning (ADL) [12], Extended Sparse Representation Classification (ESRC) [5], Sparse Variation Dictionary Learning (SVDL) [27], Discriminative Transfer Learning (DTL) [9], Low-rank Regularized Generic Representation with Block-Sparse Structure (LRGR-BSS) [15] and k-Nearest Neighbor based Multi-manifold Discriminant Learning (kNNMDL) [6].
Additionally, we compare the above methods with two state-of-the-art deep learning techniques, i.e., RegularFace [33] and ArcFace [4].These two techniques are not specifically designed but can be used for SSPP face recognition. We do not modify the network structures of of RegularFace and ArcFace, and pre-train them with CASIA-Webface [30] dataset. We use the two deep learning methods in the following two ways. (a) We input the pre-trained RegularFace and ArcFace with gallery set and probe set to extract deep facial features, then conduct recognition with 1-NN classifier. (b) We input the pre-trained two deep networks with generic set, gallery set and probe set to extract deep features respectively, then feed them to DTL and EDFL directly for SSPP face recognition, which derives four methods, denoted by RF-DTL, AF-DTL, RF-EDFL, AF-EDFL, respectively.
Performance comparison on FERET database
We first treat FERET database as the target domain, and use CAS-PEAL training set and LFW subset as source data to simulate enhanced discriminative feature learning across ethnicity and image condition, respectively. Note that most of face images in FERET database are that of Caucasian race, though it includes diverse ethnicities. We apply FERET database to roughly represent an ethnicity which is different from Chinese people. Tables 2 list the recognition rates of various methods on FERET database when using generic dataset across ethnicity and imaging condition, respectively.
Performance comparison on FERET database when CAS-PEAL database is used as source data
Performance comparison on FERET database when CAS-PEAL database is used as source data
Performance comparison on FERET database when LFW database is used as source data
As observed from the results in the two tables, FLD performs worst since it directly applies the source domain model to the target domain without any adaptation. Among the non-deep features-based methods, those specially designed for SSPP with using the generic data perform much better than FLD. Though SVDL obtains the best results in some situations, EDFL achieves the highest overall recognition accuracy. Besides, by observing results in Tables 1 2, we find that AGL and ADL obtain lower recognition rates when LFW is used as the source data, though there are 10 samples per person in LFW database. This reveals that the complex imaging condition negatively impacts AGL and ADL and hinders them to take the advantages of within-class variation explored from 10 samples. By comparison, other methods can utilize the advantage of big sample size to boost the performance. No matter whether the domain adaptation is performed across ethnicity or imaging condition, our proposed EDFL can achieve competitive results on FERET database.
Compared with non-deep features, the deep-features perform obviously better. RegularFace and ArcFace are not especially developed for SSPP face recognition, but the pre-trained models produce higher recognition rates than those specifically modeled non-deep featured based methods, which implies the good feature representation of deep learning networks. With the deep features extracted from RegularFace and ArcFace, both DTL and EDFL can boost the recognition performance, and EDFL benefits more from deep features than DTL. Among all the compared methods, AF-EDFL, i.e., EDFL with ArcFace deep features, achieves the highest recognition rates.
Similar to the evaluation on FERET dataset, we test each method under two scenarios, i.e., domain adaptation across ethnicity and across image condition, respectively. The target domain is CAS-PEAL dataset, which consists of face images taken from Chinese people. The evaluations on CAS-PEAL database are shown in Table 3 and Table 4, from which similar conclusions as before can be drawn. FLD obtains the lowest recognition accuracy, other methods perform better than it, and the proposed EDFL performs the best among non-deep methods under two scenarios. For the deep-features based methods, most outperform non-deep methods, but when CAS-PEAL database is used as the generic set, EDFL obtains competitive recognition rates to RegularFace and ArcFace. AF-EDFL is still the best one.
Performance comparison on CAS-PEAL database when FERET database is used as source data
Performance comparison on CAS-PEAL database when FERET database is used as source data
Performance comparison on CAS-PEAL database when LFW database is used as source data
Comparing Tables 4, it can be observed that different generic set has a noticeable impact on the recognition performance, especially when conducting using LFW as the generic set, EDFL shows significant superiority over other non-deep features based methods. Nearly all approaches obtain much higher accuracy after switching the generic data from FERET to LFW, the reason might be that the sample size of generic dataset (FERET) is too small (only 3 image per person) to explore more variations.
In real-world applications, the labeled face images are usually collected in laboratory under controlled conditions, i.e., laboratory images, but the unknown faces are captured in unstrained environment, i.e., real environment images. Thus, applying the laboratory images to boost the recognition performance on real environment images is quite important for a face recognition algorithm. To assess the performance of different methods in this setting, we conducted two groups of experiments: LFW dataset is used as target, CAS-PEAL and FERET database are used as source domains respectively. Table 5 and Table 6 illustrate the corresponding performance comparison, respectively. Among all non-deep techniques, EDFL performs the best and takes superiority over the second best one by a large margin of about 10 and 7 percentages respectively in the two experiments. Moreover, applying CAS-PEAL database to be the generic set achieves higher recognition rates for all methods.
Performance comparison on LFW database when CAS-PEAL database are used as source data
Performance comparison on LFW database when CAS-PEAL database are used as source data
Performance comparison on LFW database when FERET database are used as source data
However, when CAS-PEAL database is used as the generic, the specifically designed methods for SSPP face recognition obtain higher recognition rates than two deep learning based frameworks, i.e., RegularFace and ArcFace. It indicates again that different generic sets make a difference on recognition performance, and an appropriate generic set may be more important than the models.
As clearly seen from the above evaluation, our proposed EDFL outperforms other approaches designed for addressing SSPP face recognition. According to Table 1–6, no matter whether domain adaptation is conducted across ethnicity or imaging condition, EDFL is able to achieve promising recognition accuracy. EDFL can boost the recognition accuracy by using features extracted from deep learning frameworks and obtains higher recognition rates than deep learning methods themselves. Moreover, DTL is the most similar method to the proposed EDFL, both attempt to achieve the following two goals: transferring knowledge from source to target domain and learning discriminative subspace. However, EDFL obtains better recognition results than DTL in every case. This is due to the fact that DTL learns only one mapping to simultaneously accomplish knowledge transferring and discriminative learning but learning single mapping may be difficult to realize the two goals well. The proposed EDFL learns one mapping for domain adaptation and the other one for discriminative learning, and it unifies the learning of two mappings into one framework, thus EDFL outperforms DLT on both goals.
Conclusions
To cope with face recognition with single sample per person, in this paper, we have proposed a novel enhanced discriminative feature learning (EDFL) method from the perspective of domain adaptation. EDFL employs an auxiliary face dataset in which face images are labeled and involve diverse variations. To make better use of the knowledge in the auxiliary set, we first propose generalized domain adaptation to minimize the distribution mismatch between auxiliary set and gallery set, and then integrate the domain adaptation and Fisher discriminative learning into one scheme. Further, we have developed an efficient iterative optimization algorithm to obtain the solution of EDFL. To estimate EDFL approach, extensive experiments have been conducted on three databases. By comparing with several specially designed methods for SSPP, and deep learning-based methods, the promising recognition performance of EDFL demonstrates its effectiveness for single sample per person face recognition.
In the EDFL framework, we consider only one source domain, i.e., one auxiliary dataset. Though the generalized domain adaptation can be directly used for multiple source domains, it does not guarantee to output better performance as different source domains may also require adaptation. As a good generic set is good for improving recognition results, but to a specific probe set, what kinds of generic sets are best is also an interesting problem. In future, we will work on these issues.
Footnotes
Acknowledgments
This work was partially supported by the Humanities and Social Science Research on Youth Fund Project of Ministry of Education of China (Grant No. 19YJC870003), the Natural Science Foundation of Jiangsu Province (Grant No. BK20210576), and sponsored by NUPTSF (Grant No. NY219038, NY220045).
