Abstract
In this paper, we developed a new method to extract semantic facial descriptions by using an Axiomatic Fuzzy Set (AFS)-based clustering approach. Landmark-based geometry features are first used to represent facial components, and then we developed a new feature selection algorithm to select salient features based on feature similarities defined in AFS. Finally, the AFS-based clustering technique was used to extract the high-level semantic concepts. Extensive experiments showed that the proposed method can achieve much better results than the conventional clustering approaches like K-means and Fuzzy c-means clustering (FCM).
Introduction
Content-based image retrieval (CBIR) aims to find images from an image database that closely matches a query given by users. Initially, in CBIR systems a user submits a query image, and the CBIR systems would extract local features, such as color, texture, etc. [1]. And then they try to find possible matches of these local features in the given database. In fact, a more desirable CBIR system from user’s perspective can be achieved by improving its capability to process semantic query, since in many cases the accurate query image may be not available or reliable. The semantic-based image retrieval system allows the user to input a query in terms of natural language expression [2].
Face image retrieval (FIR) is a special type of CBIR, where the goal is to find similar human faces to a query either in a form of image or semantic description. Currently only a very limited amount of work have been done on semantic-based FIR systems due to its difficulty. One main challenge is to find a robust and reliable method which can automatically determine the semantic description of a face image only based on the low-level features and this is known as the “semantic gap” problem [3]. Among existing FIR systems, one existing approach is to use a probabilistic method based on local low-level features, such as color, texture, and high-level semantic labels to re-rank the database [4]. Another approach begins with 24 manually marked key points and associated keywords that characterize each face. Singular value decomposition is applied to create a Latent Semantic Space allowing face images to be retrieved by a semantic query [5]. However such method is not appropriate for large databases as all face images in the database need to be annotated manually. Another approach is to utilize fuzzy-based method for the semantic retrieval of face images. More specifically, it uses fuzzy sets to bridge the semantic-gap between low-level visual features and high-level descriptions and the fuzzy c-means method (FCM) is used to calculate the similarities between different subjects [6]. It is fascinating to use fuzzy sets to narrow the semantic-gap since it is easier to use fuzzy descriptions to represent high level semantic descriptions, such as “big eyes”, “long nose”, etc. In all these works, the high level semantic concept descriptions are required and they are given partially or fully as ground truth information. Such requirement hinders the applicability of these systems since it is hard to obtain the ground truth information for semantic descriptions objectively.
Traditional clustering methods, such as K-means, FCM, Gaussian Mixture Model (GMM), etc., are used for semantic-based clustering [6, 7]. However these approaches are more likely to produce poor results due to the lack of relation with semantics or simple adoption of Euclidean distance [8]. In order to overcome the weakness of those clustering methods, Ren and Liu proposed a novel approach called AFSC [9]. It used a landmark detector to extract facial components with landmark points, such as eye and nose, and then selected salient features based on similarity principle and finally an AFS-based fuzzy clustering method [10, 11] was used to extract semantic descriptions by clustering those facial components. As a by-product of AFS theory, the fuzzy rules were extracted to represent semantic description for defined facial components. However, we found two shortcomings in that work. First, the features used are mainly based on overall size or area and such features contain little shape information. Second, the feature selection algorithm they used is trying to find similar features and discard dissimilar ones. In this case, they may loss part of information and lead to large biases in the clustering process.
In this paper, we are following Ren and Liu’s framework, using facial landmark to detect facial components and clustering them with AFS-based fuzzy clustering method. However, we utilize more features related to local shape information and also propose a new feature selection algorithm based on similarities of features. Also the relationship between the number of features and clustering results is investigated. Finally the clustering results based on the proposed approach are compared with FCM, K-means, and AFSC, and it confirms the superiority of the proposed approach to these approaches. In addition, the achieved stable semantic descriptors possess better semantic descriptions in terms of an area metric and other criteria.
The remainder of this paper is organized as follows: Section 2 briefly reviews related work on facial landmarks detection and AFS theory. Section 3 details the proposed semantic facial descriptor extraction method. Experimental results and the effect of parameter k are discussed in Section 4. And the paper is concluded in Section 5.
Related works
Facial landmarks detection
In order to extract semantic descriptions for each facial component, landmarks on a face are required. Automatic facial landmark detection is thus necessary for our work. The representative classic approaches are Active Appearance Models (AAMs) [12] and elastic graph matching [13, 14]. Recent works focused on local part detectors, known as Constrained Local Models (CLMs) [15]. There are two major tasks in detecting the face landmarks. The first one is face detection, which aims to find face(s) in an image and extract its location. One popular approach is the Viola and Jones face detector [16]. The second task is facial landmarks detection, which aims to extract facial components, such as eyes, nose, mouth and face contour from detected face by putting landmarks on each component. Some recent facial landmarks detectors are reported in [17–19]. Among them, Zhu and Ramanan’s approach [17] is considered as one of the best approaches for facial landmarks detection.
Even through the performance of this model is sufficient for some applications, there are only 68 facial landmark points for frontal face which are not sufficient to describe semantic details of each component. Therefore, Liang and Liu [20] extended the frontal face models from 68 landmark points to 130 landmark points, which can cover the facial components in regions instead of only lines in order to provide better shape information. The results show their model is significantly more accurate on detecting facial landmarks for frontal face images.
AFS theory
It is common for human to describe a face with semantic concepts, such as “a face with large eyes, small nose and round face contour”, while features dealt with by computer always are low-level visual features (e.g. color, texture, and curvature, etc.). This well-known semantic-gap between these two level features is the major challenge for semantic-based image analysis, especially in face image retrieval. On the other hand, fuzzy sets have been proven to be effective in semantic image analysis [21, 22]. It motivates us to utilize fuzzy model to bridge the semantic gap between feature space and high-level concepts.
Fuzzy set theory, which was originally proposed by Zadeh [23], provides a general way to acquire linguistic IF-THEN rules. In conventional fuzzy theory, the membership functions are often given by personal intuition manually and the logic operations are equipped by a kind of triangular norms, or shortly t-norm which is chosen in advance and independent of the distribution of raw data. However, the large-scale intelligence systems in real-world applications are usually very complex, containing such a large number of concepts that it is difficult to define the membership functions manually.
AFS (axiomatic fuzzy set) algebra was proposed by Liu [10, 24–30], and it has been experimentally proved to be a powerful approach for semantic concept extraction and interpretations for fuzzy attribute [9, 32]. The aim of AFS is to explore the possibility of fuzzy set theory and probability theory working in concert, so that uncertainty of randomness and fuzziness for a concept can be treated in a unified and coherent manner. The main idea behind AFS is to transit the extracted information from the observed data into membership functions and implement their logic operations. Researches in [30, 33] defined the membership functions based on the fuzzy logic operations by taking both fuzziness and randomness in account. One advantage for AFS-based clustering algorithm is that it can calculate the membership degree between the feature vectors extracted from the images and the linguistic terms that characterize the semantics. An example to demonstrate the approach of computing the membership functions was presented in [27]. The research monograph [30] offers a comprehensive introduction of AFS theory and its applications.
AFS-based semantic facial description
Our approach starts with an automatical landmark detection and then we extract geometric features from the landmark points, including both global and local features. For the sake of removing redundant features and improving efficiency, we propose an unsupervised feature selection approach. Then a fuzzy clustering method is applied to processed data. With the information of clusters, we extract semantic concepts via fuzzy membership. Figure 1 framework shows the whole framework of the approach.
Feature extraction from landmark points
By noting that all the features (Fig. 2) used in [9] are global features in terms of size, we argue that global features are not enough for the consistent of shape in each cluster. Besides the global features used in their approach, we need some local features that can depict the shape of facial components. Then we define the “star” features for all facial components, in which each feature is the Euclidean distance between the eye’s center and one of the boundary points as shown in Fig. 3 featureExtraction. Note that the “star” feature is similar to one of the global features, centroid distance. But the later one is the sum of Euclidean distance of all features in “star” for eye. And one can also notice from Fig. 3 featureExtraction that the “star” feature we defined for nose is totally different with the centroid distance of nose. Instead of using all boundary points, we only picked up some key points related to the shape of nose. With those local features, we can guarantee the shape is consistent in each cluster, then we also need all the global features to guide the clustering process with expectation to cluster all eyes into three classes eventually with different size. So we combine the five features in Ren’s approach and the proposed “star” feature, in total, 17 features are used in this paper.
Unsupervised feature selection based on fuzzy similarity
After we combined the global and local features, it is obvious that there are some redundant features, especially for the 12 “star” features. One may easily realize we can simply use f1, f4, f7 and f10 to represent the “star” feature of eye. But we found it is hard to find similar useful subset for nose or other components. Then we propose an unsupervised feature selection algorithm. The proposed feature selection algorithm is based on fuzzy similarity in AFS theory. For feature α and β, the similarity between them is defined as follows:
The similarity is motivated by Jaccard similarity coefficient. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. But here instead of using size of sample sets, we are using the fuzzy membership function. Fig. 4 jaccard gives a intuitively graphic explantation of the similarity.
The proposed feature selection mainly involves two steps: (1) Vote each feature by all others based on pairwise similarity, and find the one which is the most similar to others; (2) Depending on the selected features, iteratively add one new feature which is the most dissimilar one to the existing selected features in. Intuitively, the first feature we picked contains the most information, just like principal component. Then, the next step we search the most dissimilar one with existing features, and that is like the orthogonal principal component.
Suppose we have n features, for each feature f
i
(1 ≤ i ≤ n), we can compute the similarity SI
ij
with all features and rank them as f
ij
(1 ≤ j ≤ n) in a descending order. In fact f
ij
is a re-ordering of {f1, f2, ⋯ , f
n
}. Let S
ij
be the score of f
j
based on the index j in the list f
ij
, i.e.,
Fig. 5 illustrates an example of feature voting. Note that the score is the “dissimilarity score”. In this case the “base” feature is defined as the the most similar one with all others, which has the lowest score. Now, let F represent the selected feature set,
After we find the “base” feature, then the rest is done by a greedy process. Technically, we can have all features re-scored based the selected feature set F as stated below. Let N
F
be the size of F, the new score of f
j
is,
Then our aim is to select another feature which is the most dissimilar to what we have already selected, so the new feature f
j
for our selection is,
One can see that if we keep going the above algorithm we will select all features one by one iteratively. Eventually a stopping criteria is required here. One intuitive choice is the score of new added feature should be larger than the “average score” of selected feature set F, i.e.,
Feature similarity matrix Mn×n.
Selected feature set F
1: Let S ij represent the score of feature f j given by feature f i , S j be the total score of f j (refer Eq. (5) and Eq. (4));
2: Sorting M on row by descending order, Scoring each feature based on ranking;
3: j : =1 to n
4: ;
5:
6: i : =1;
7: ;
8: i ← i + 1;
9:
10:
11:
12: S j = 0;
13:
14: ;
15:
16:
17:
18:
19:
20: ;
21: i ← i + 1;
22:
Fuzzy characteristic description for facial components
To utilize AFS clustering, the first step is to transfer the data in feature space to AFS fuzzy space. Define F, the facial component set, F = {lefteye, righteye, nose, mouth . . .}, f ∈ F. Suppose we choose top k features from the feature ranking in the previous step. Then we divide each feature into 3 fuzzy parts with semantic interpretation, “large”, “medium”, “small”, represented by fuzzy terms , where is the jth fuzzy term associating with the ith feature of facial component f.
Let represent the facial component f ∈ F on the kth face image I k . For example, is the right eye on the kth face image I k . Let be the membership degree of belonging to fuzzy term m. Then we select salient fuzzy attributes/terms to represent each facial component significantly, which we call fuzzy characteristic description. For such purpose, we first define a set of fuzzy terms for which can be given as follows:
where , i.e., the membership degrees of belonging to fuzzy terms in should be the largest value among the membership degrees of belonging to fuzzy terms in M
f
. This set is in fact all the best fuzzy terms to characterize this facial component f and then the fuzzy facial component characterization can be given as
where “∨” and “∧” are the fuzzy logic operations in AFS algebra defined in [30], and the semantic logic expressions of “∨” and “∧” are “or” and “and”. The computation details for , and can be found in [27]. The fuzzy terms in can be regarded as the most salient characteristics of (the facial component f of the face image I k ). Thus, all the most salient characteristics are combined to describe as Equation (10). Keep in mind that the whole process is only for each component.
After the data has been transferred to the AFS fuzzy space, similarity of facial component and is defined as follows:
The clusters , which have more than one face images, can be obtained (i.e., the cluster with one single face image is discarded). In this case, we can obtain the initial clusters for Q = (q ij ).
Remember is a characterization of , and also we have obtained an initial clusters . Then we select the best for constructing the semantic description of each cluster based on the initial clusters as follows.
The elements in Γ
i
are some fuzzy characteristic descriptions in the ith cluster (). The motivation of this selection is that only some representative descriptions can be used to represent the semantic descriptions of , and others are not typical enough to represent its cluster or are noise. Therefore, the fuzzy descriptions of some representative face images which can represent their facial component cluster are collected in Γ
i
as Equation (14). All the most salient characteristics in Γ
i
are combined to describe the initial facial component cluster . Consequently, the sematic description of each facial component cluster can be defined as follows:
As explained previously, this set represents the salient description for each initial cluster. One can see that the universality and particularity of the semantic description of can be controlled by ω and λ. The effects of the variation of parameter ω and λ are analyzed in [27], and plenty of experiments illustrate that the algorithm is not sensitive to the setting of the parameters if the parameters are selected in reasonable intervals.
Then all the semantic description of each facial component cluster can be regarded as a classifier, in order to classify all the instances again. This process is called as re-clustering process. Its aim is to revise the initial clusters and cluster the lost instances whose similarities r
ij
with other instances are lower than α. Finally, is re-clustered by measuring the membership degrees of belonging to as follows:
i.e., if the membership degree of belonging to C q is the largest one among the membership degrees of belonging to , then .
Empirical tests, observations, analysis and evaluations are presented in this section. 249 frontal faces from 249 subjects in Session one of Multi-PIE and 134 frontal faces from 134 subjects in session one of AR database are used. Since the face sizes vary in scale, the images were scaled so that the Euclidean distance between the pupil centers were fixed with the average distance of the database. The results are illustrated in two aspects. The first one is semantic extraction, we show the relationship between semantic concepts and low-level features. Each facial component is clustered into three groups, “large”, “medium”, “small”. Obviously “medium” is not a useful semantic concept in terms of applications. It is not so useful to describe a person with “medium eye” or “medium nose”. Therefore we only consider about two salient clusters, “large” and “small”. As for the second two features, we show that the clustering result with the proposed approach is better than those obtained by K-means and FCM in terms of area of facial components and some defined clustering indexes.
Eye clustering
Comparison of semantic concepts
First after the features selection with 17 features, the global feature centroid distance and some “local” features are selected for these two databases as shown in Fig. 6 eyeFeature.
From Fig. 6 eyeFeature we can see that, except for centroid distance, one of the features in Ren’s approach, is selected as the “global” feature, three “local” features are also selected to keep the shape information. This is very important, because the global feature can guide the clustering process while the other local features can help to hold the shape consistent. In addition, those features are selected by an automatic feature selection algorithm, for both Multi-PIE and AR data set, one global feature and three local feature are selected. To illustrate the advantage of the feature selection and eye clustering, the relationships between semantic concepts and those low-level features are shown as below after clustering.
From the above cluster’s description, it is easy to see that the characteristic combinations contain not only the global feature (centroid distance) but also local features related to the height and width of the eye. That can make the clustering better and more consistent. Figure 7 eyeExample shows the improvement of our method in terms of shape. It is obvious that in Ren’s approach, some long and narrow eyes are clustered into large eye group, which is not appropriate.
In order to show the quality of clusters visually, Fig. 8 eyeComparison shows four examples of facial image with large eye in Large Eyes cluster and four in Small Eyes cluster. And the comparison result is very encouraging in that the eyes of persons in Large Eyes cluster are distinctly larger than that in Small Eyes cluster, and also the shape is kept more consistently.
Comparison of clustering results with different clustering approaches
Considering a fact that there is no semantic ground truth data in available data sets to test the accuracy of our results. As we are actually clustering the “size” of the facial components,the area is chosen to be the validity criteria, which is the closest to human perception (i.e., a larger eye is the one with larger area). The area criteria is also adopted in [9]. In this paper, a new class separability is defined for evaluating the clustering performance. Below items are three indexes. The average area of small eyes’ class . is defined as the mean of all subjects in the small cluster, a lower value means the eyes in small cluster are smaller than common people; The average area of large eyes’ class . is the mean of large cluster, a higher value means eyes in large cluster are larger than common people; The class separability CS. CS is defined as the quotient of the minimum inter-class distance and the maximum intra-class distance.
Our proposed method is compared with the conventional clustering algorithms, K-means and FCM, also the novel fuzzy approach AFSC on the face images data set Multi-PIE and AR. Due to the sensitivity to initial conditions, K-means and FCM are performed for 100 times to obtain the average result.
From the Table 1 algorithmComparison_eye we can see, compared to K-means and FCM, our method is much superior for all three indices in both Multi-PIE and AR. That is to say, our method can gain more clear and reliable clusters in terms of the size and shape of the eye. In addition, K-means and FCM can only gain the partition clusters, and no semantic descriptions which can be further applied to semantic FIR system. On the other hand, compared to Ren’s approach AFSC, our method has significantly better performance in Multi-PIE data set and it is slightly superior in AR data set. It is worth to mention that in AFSC, the area of facial components itself is one of the low-level features while same time acting as clustering criteria. Such selection can make the result seems better in bias. While in our algorithm, the feature we used is called “star” features, which are not associated with area directly. In this sense, our approach is more objective.
Nose clustering
We further apply the algorithm to nose clustering, and analyze both semantic results and clustering results as eye clustering. For the clustering performance, we use the evaluation criteria defined in Section 4.1.2.
Comparison of semantic concepts
Figure 9 noseFeature shows us the features selected by our feature selection algorithm for nose clustering. As we can see, two global features centroid distance and height in Ren’s approach and two local features from “star” are selected in AR database, while in Multi-PIE database, centroid distance and four local features are selected. The combination of global and local features are better than the features used in Ren’s approach as global feature can guide the clustering process while the other local features can guarantee the shape is consistent in each cluster.
To prove the superiority of there feature combinations, we present the clustering description, which is the conjunction of samples in clusters, below. Note that the clustering description is the representation of low-level features, and meanwhile in terms of high-level semantics, we can easily know which one is “large nose” while the other is “small nose”. Then semantic concepts are expressed by low-level features and the semantic gap is naturallybridged.
From the clustering description we can see that, not only global features in Ren’s approach [9] are used, but also some local features. In this case, different noses are distinctively grouped into different clusters depends on both size and shape. In order to show the quality of clusters visually, Fig. 10 noseComparison shows four examples of facial image in large nose cluster and four in small noses cluster. And the comparison result is very encouraging in that the noses in large eyes cluster are distinctly larger than those in small eyes cluster, and also the shape is kept more consistently.
To show the improvement of our method from Ren’s approach [9], we show some examples that they failed to cluster into appropriate groups in Fig. 11 noseExamples. It can be seen that, those noses are indeed large in terms of area of some other global features used in their approach, but they are clustered into medium noses as they are not large enough in terms of the local features in our method.
Comparison of clustering results with different clustering approaches
Table 2 presents the performance of different algorithmComparison_nose presents the performance of different algorithms for nose clustering. As we can see, K-means and FCM perform similar results while AFSC performs much better than them on Multi-PIE but similar result on AR. Compared to traditional approaches K-means and FCM, or the novel fuzzy approach AFSC, our method is superior on both Multi-PIE and AR dataset. It can bee seen from all three evaluation criteria. The mean area of small nose group is the smallest one while the mean area in large nose group is much larger than others. And also the class separability tells us the proposed method gives a more meaningful clustering result.
Effect of parameter k
In the proposed algorithm, the size of reduced feature subset and hence, the scale of details of data representation is controlled by the parameter k (refer Equation (8)). Figure 12 parameterK illustrates such an effect on two data sets, Multi-PIE and AR. For certain range of k, it is observed that there is no change in the reduced subset, i.e., no reduction in dimension occurs. However, as expected, the size of the reduced subset decreases overall with increase in k. In this way, the representation of data at different degrees of details is controlled by its choice. This characteristics is useful in many areas where multi-scale representation of the data is often necessary. Note that the said property may not always be possessed by other algorithms where the input is usually the desired size of the reduced feature set. The reason is that changing the size of the reduced set may not necessarily result in any change in the level of details. In contrast, for the proposed algorithm, k acts as a scale parameter which controls the degree of details in a more direct manner.
Here we conduct some experiments to show the change of clustering performance along with the number of selected features. We use two validation indices, V1 and V2. Both of them are related to the indices used in Section 4.1.2. V1 is defined as the difference of and . V2 is the class separability CS. They are normalized to [1] for clear observation.
Figure 13 featurePerformance has shown that the clustering performance is sensitive with the change of number of features. In both Multi-PIE and AR data set, four features gives us the best clustering performance. With more than for features, the performance would decrease. It also proved the significance of our feature selection algorithm that help us to reduce computation cost as well as find the best scale of details of data representation.
Conclusion
In this paper, we developed a new feature selection algorithm to extract semantic facial component descriptions based on AFS fuzzy clustering algorithm. It has shown that the semantic-gap between high-level semantic concepts and low-level visual features has been bridged automatically by the fuzzy descriptions of clusters. Compared to the K-means and FCM, which have no semantic interpretation for their clustering result, the proposed method has a significant advantage. Also the semantic descriptions produced by our method are more consistent and meaningful compared to the novel fuzzy approach AFSSC. Since additional local features are detected, our method could reveal more latent shape cue. On the other hand, the clustering performance also outperforms other methods in both Multi-PIE and AR data set in terms of validation of area and class separability. The proposed feature selection method is proved to be useful, and it gives us an effective tool for multi-scale of data representation and dimension reduction.
