Leaf classification using multiple feature analysis based on semi-supervised clustering

Abstract

Multiple features such as the margin, the shape and the texture of plant leaves are of great importance for classification of plant species, as they are often regarded as the unique features to identify plants. In this paper, we study the performance of a recently proposed semi-supervised fuzzy clustering algorithm with feature discrimination for leaf classification, based on features generated by principal component analysis of color images. The method outlines a basic framework for judging the weights of different features by adopting multiple feature matrixes obtained from the initial images as input data and the clustering results of the proposed clustering algorithm as output data to distinguish dissimilarities between various leaves. Real leaf images are employed to evaluate its performance and the experiment demonstrates that these results suggest that the margin feature, the shape feature and combination feature especially the margin feature and combination feature may be the best choice for leaf classification.

Keywords

Semi-supervised clustering leaf classification multiple features performance analysis pairwise constraints

1 Introduction

Multiple features, including the margin, shape and texture of plant leaves, play an important role in plant identification, as they are often representative of the unique biological characteristics of the leaves between various species. The traditional approach to classifying plant species is to assign semantic labels to them manually [1]. However, this method is time consuming and demands lots of specialist knowledge. So a novel convenient method would be of great benefit to plant biologists and botanists, and many recent studies have explored image analysis based plant leaf recognition technologies.

The plant leaf is commonly a critical component of plant identification, and leaves can be recognized by their color, margin, shape and texture (among other features). While for many plant species, they may have the same green color which may vary with seasons and temperatures, the other features will be more reliable as a means to identify species. Furthermore, most of the features mentioned above can be easily extracted from images through digital-image-processing techniques.

The main idea of semi-supervised clustering algorithms is to use different kinds of prior knowledge (prior cluster membership or constraints) to improve the clustering performance. Semi-supervised clustering algorithms integrate the advantages of both supervised and unsupervised algorithms with less human effort, and feature appropriate interaction and adaptable accuracy by taking class labels and pairwise constraints or prior membership degrees into account [2 –11]. Semi-supervised approaches have proven effective, and have been widely used in different areas for classification[5 , 11]. Although various studies have explored the use of such techniques in numerous areas such as web document, images [8], biological information [6, 11, 6, 11], text classification, etc., none of them have been employed for leaf classification.

In this paper, we describe a novel semi-supervised fuzzy clustering algorithm with feature discrimination (SFFD). Given that SFFD is a novel algorithm intended to deal with semi-supervised clustering problems based on both feature discrimination and objective function optimizations with adaptive distance norm, we explore its real-world potential by calculating multiple features from leaf images as the dataset and evaluating the performance of SFFD on leaf classification. We assess the discriminative power of each feature through its effect on clustering accuracy and its performance in terms of normalized mutual information (NMI).

Our approach builds on that of Beghin et al’s for shape and texture analysis [12], which has proven its effectiveness for leaf identification. We calculate the contour signatures of the leaf for shape analysis and make use of orientations of edge gradients to analyze the margin and macro-texture of the leaf using a Sobel operator [12]. However, the multiple feature dataset is usually of high dimensionality and principal component analysis (PCA) can be an effective approach to handle this.

The paper is organized as follows. Section 2 outlines the existed researches for leaf classification. Section 3 introduces a framework for leaf classification methods of the raw data and the algorithm description of semi-supervised fuzzy clustering with feature discrimination. Our experimental setting is described in Section 4 and the results for the performance analysis among multiple features are shown in Section 5. Finally Section 6 draws a conclusion.

2 Related work

Since a systematic classification method of plants proposed by the Swedish botanist Carolus Linnaeus in the 18th century, plant classification has been studied by scholars from various areas, adopting many different approaches. Since this time, much research has been focused on leaf classification. With the rapid development of digital image processing techniques, various methods for leaf classification based on computer vision and pattern recognition have been investigated in many ways.

Existing methods for leaf classification based on the phenotypical characteristics can be divided into four groups: margin based analysis [13, 14], shape based techniques [15, 16], color based analysis [17] and techniques combining several of these features [12, 18]. The margin based algorithms make the analysis of leaf margins from herbarium images, especially the margin teeth. After identifying teeth, it is possible to automatically extract characters from them such as tooth size and shape [41, 42]. The shape based techniques usually take the geometric features [19, 20] and shape factors into account. They often use the contour signatures of the leaves and calculate the differences between them [21 –25]. The most common texture based methods are usually based on statistical analysis of the pixels (co-occurrence matrices, etc.) [26, 27], and their spectral analysis (Fourier Transform, Wavelet Transform, Gabor filters, etc.) [28 –32]. Some approaches combining more than one feature above have also been studied and demonstrated to offer better performance[12, 18, 12, 18].

Research into semi-supervised clustering has largely focused on intensively studying various formulations for constraints, conversion of diverse classical clustering algorithms into partially supervised ones and further discussion about different applications, which can be divided into two parts: hard constraints based and fuzzy based methods. Such algorithms have commonly been applied to areas including hand-written recognition, document classification, face recognition, object detection and image tracking, etc. [33 –35]. As different attributes or features have various influence on theability to identify certain plants, a weighted feature could be more practical. In many real world tasks, such as image retrieval applications [36], semi-supervised feature selection [37] such as pairwise constraints methods [36, 38] are more useful than obtaining the true class labels, because it is easier for users to decide whether some pairs of instances belong to the same class or not. So it is a better choice for us to guide our clustering process.

In this paper, we propose a framework for leaf classification based on our recently proposed semi-supervised clustering method. The contour signature is employed to detect a shape feature from the initial image and a Sobel operator is utilized to analyze the margin and the texture feature (see Fig. 1). As the clustering part, SFFD takes the feature discrimination and pairwise constraints into account. The feature discrimination process attempts to reduce the complexity of the clustering task by weighting relevant features, while the pairwise constraints are a useful guide to achieve better classification with fewer steps.

3 Framework for leaf classification

The proposed leaf classification process includes five sub processes (see Fig. 2). During the whole process, different morphological features are employed as multiple features to partition the similar leaf images into the same group.

The first step for leaf classification is image acquisition. A set of leaf sample pictures (termed the input image) is captured with digital camera to construct an initial image database for classification by the users.

The next step is image preprocessing. In this step, several standard image processing operations are carried out to enhance the relevant features of the input image for further preprocessing. The corresponding operations include gray-scale conversion, image segmentation, binary conversion and image smoothing. Because the color of the leaf is changeable with thevariation of atmosphere or season, the color feature has low reliability. So a gray-scale image is the better choice, and the image is segmented from its background after it is converted to gray-scale. Then it is converted to a binary image and finally image smoothing isperformed.

In the third step, different morphological features are extracted to construct the multiple feature vector matrices. The morphological features contain margin feature, shape feature, texture feature and combinationfeature.

In the fourth step, multiple feature vector matrices are calculated from the processed image to generate multiple feature datasets. Then, a clustering algorithm is applied to partition the datasets into groups. With the help of the clustering algorithm, the input images are categorized into various species whose leaf image is grouped to the same cluster. According to different features, the input image can be clustered (grouped) in different ways.

In the last step, the system shows us the classification results by partitioning the similar images into the same clusters and dissimilar images into the different clusters based on a certain distance, density measure and so on.

Among the five steps above, the third step and the fourth step are the most important in classifyingplants. Several classification techniques such as k-Nearest Neighbor (k-NN), Learning Vector Quantization (LVQ), Probabilistic Neural Networks (PNN), Radial Basis Function (RBF) and Support Vector Machine (SVM) have been used as classifier to measure the similarity between different images [12 , 39, 40]. Also the preprocessing step that makes the leaves well segmented from its background is also a key factor to improve the performance of the classifier. Certain features such as leaf margin teeth could be automatically located and measured from images of herbarium specimens [41, 42]. However, our recently proposed semi-supervised fuzzy clustering algorithm is used as a new approach to deal with this problem in this paper.

3.1 Construction of multiple feature vector matrices

Our system uses multiple feature vector matrices as corresponding input data to partition the leaves. Multiple feature vector matrices include four matrices such as margin feature vector matrices, shape feature vector matrices, texture feature vector matrices and combination vector matrices. We use a Sobel operator, contour signature and gradient orientations methods to extract the vectors except for combination vector. And then we take all their columns together to generatethe combination vector.

3.1.1 Margin feature extraction methods

We use a Sobel operator to detect the edge information and then calculate distances between the edge and centroid of the image in Equation (1), finally extracting 64 ratios of corresponding distance to maximum distance as the margin feature in Equation (2). In order to reduce the dimensionality of the feature vector, we utilize principal component analysis (PCA) to retain a feature vector that is 8 times smaller than the original one (see Fig. 3). The distance, d (i), between a contour point and the centre of the image and the margin feature, M(i), are obtained by: $d (i) = \sqrt{(b_{x} (i) - c_{x})^{2} + (b_{y} (i) - c_{y})^{2}}$ (1) $M (i) = d (i) / {max}_{i = 1}^{n} d (i)$ (2) where n is the number of points to be sampled, b _x (j), b _y (j) are the x and y co-ordinates for the ith contour pixel, and c _x, c _y are the centroid of the leaf.

3.1.2 Shape feature extraction methods

A contour signature is employed for analyzing shapes. First, we obtain the outline of each image by analyzing whether the neighbors of each pixel include background pixels. The points at four sides (top,bottom, right and left) are used to calculate the centroid of the image. Then, we compute two values of shape feature as below: $f (i) = \sqrt{(b_{x} (j) - c_{x})^{2} + (b_{y} (j) - c_{y})^{2}}, j = \frac{i \times l}{n}$ (3) $g (i) = | tan (\frac{b_{x} (j) - c_{x} (j)}{b_{y} (j) - c_{y} (j)}) - \frac{2 i π}{n} |, j = \frac{i \times l}{n}$ (4) where l is the length of the outline and n is the number of points to be sampled, b _x (j), b _y (j) are the x and y co-ordinates for the jth contour pixel, and c _x, c _y are the centroid of the leaf. In essence, f (i) calculates the distance between a contour point and the centre of the image, and g (i) is the angle between the start point (top point) and the contour point. Combining both of these, we obtain 64 shape features (half of which are distances and half angles) to obtain a shape vector 8 times smaller (see Fig. 4).

3.1.3 Texture feature extraction methods

We compute the gradient orientations, which are regarded as the features of the relative directions of the main veins, using a Sobel operator as the texture feature for each image: $h (θ) = {\begin{matrix} \sum_{x} \sum_{y} M (x, y), if θ (x, y) = θ \\ 0, otherwise \end{matrix}$ (5)

In Equation (5), M (x, y) is the gradient magnitude of pixel (x, y) and θ (x, y) is the gradient direction. Thus, we get the texture vector h of 64 features and finally obtain a typical vector of 8 times smaller (seeFig. 5).

3.2 Species identification using SFFD clustering

Our system aims to perform the classification which separates dissimilar input images into different groups and clusters similar images as a group. Identification is performed by running a recently proposed semi-supervised method called SFFD to perform the clustering using multiple features extracted from the input images. The algorithm was designed to search for the optimal prototype parameters and the optimal set of feature weights under pairwise constraints. The key components of the algorithm are describedbelow.

(1) Distance between clusters: an inner-product norm A _i is utilized to detect clusters of different geometrical shapes in a data set.

$\begin{matrix} d_{ijk}^{2} & = & {(x_{k} - c_{i})}^{T} A_{i} (x_{k} - c_{i}), \\ c_{i} & = & \frac{\sum_{j = 1}^{N} {(u_{ij})}^{m} x_{ij}}{\sum_{j = 1}^{N} {(u_{ij})}^{m}} \end{matrix}$ (6)

$\begin{matrix} A_{i} & = & (ρ_{i} det (F_{i}))^{\frac{1}{n}} F_{i}^{- 1}, \\ F_{i} & = & \frac{\sum_{j = 1}^{N} {(u_{ij})}^{m} (x_{j} - c_{i}) (x_{j} - c_{i})^{T}}{\sum_{j = 1}^{N} {(u_{ij})}^{m}} \end{matrix}$ (7) where c _i is the cluster mean and $u_{i}^{j}$ is the membership degree of instance i to cluster j.

(2) Feature weights v _i k can be expressed as: $v_{ik} = \frac{1}{n} + \frac{1}{2 δ_{i}} \sum_{j = 1}^{N} {(u_{ij})}^{2} [\frac{{∥ x_{j} - c_{i} ∥}^{2}}{n} - d_{ijk}^{2}]$ (8) $δ_{i}^{(t)} = K \frac{\sum_{j = 1}^{N} {(u_{ij}^{(t - 1)})}^{2} \sum_{k = 1}^{n} v_{ik}^{(t - 1)} {(d_{ijk}^{(t - 1)})}^{2}}{\sum_{k = 1}^{n} {(v_{ik}^{(t - 1)})}^{2}}$ (9) where n is the number of features, K is a constant and u _ij, v _ik, d _ijk with the superscript (t-1) are their values in iteration (t-1).

(3) Taking pairwise constraints into account and applying the Lagrange multiplier method, the objective function obtained can be written as:

$\begin{matrix} J = J_{1} + α (\sum_{(x_{i}, x_{j}) \in M} \sum_{p = 1}^{C} \sum_{l = 1, l \neq k}^{C} u_{ip} u_{jl} \\ + \sum_{(x_{i}, x_{j}) \in ζ} \sum_{p = 1}^{C} u_{ip} u_{jp}) - ɛ_{I} (\sum_{k = 1}^{C} u_{ik} - 1) \end{matrix}$ (10)

$\begin{matrix} J_{1} = \sum_{i = 1}^{C} \sum_{j = 1}^{N} u_{ij}^{m} (\sum_{k = 1}^{n} v_{ik} d_{ijk}^{2}) \\ + \sum_{i = 1}^{C} δ_{i} \sum_{k = 1}^{n} v_{ik}^{2} - \sum_{i = 1}^{N} λ_{i} (\sum_{k = 1}^{n} v_{ik} - 1) \end{matrix}$ (11)

(4) The membership values of SFFD can be described as:

$\begin{matrix} u_{rs} & = & \frac{ɛ_{I}}{2 v_{rk} d_{rsk}^{2}} \\ - \frac{α (\sum_{(x_{r}, x_{j}) \in M} \sum_{l = 1, l \neq s}^{C} u_{jl} + \sum_{(x_{r}, x_{j}) \in ζ} u_{js})}{2 v_{rk} d_{rsk}^{2}} \end{matrix}$ (12) $ɛ_{I} = \frac{2}{\sum_{k = 1}^{C} \frac{1}{v_{rk} d_{rsk}^{2}}} + α \frac{\sum_{k = 1}^{C} \frac{(\sum_{(x_{r}, x_{j}) \in M} \sum_{l = 1, l \neq s}^{C} u_{jl} + \sum_{(x_{r}, x_{j}) \in ζ} u_{js})}{v_{rk} d_{rsk}^{2}}}{\sum_{k = 1}^{C} \frac{1}{v_{rk} d_{rsk}^{2}}}$ (13) where M denote the set of ‘must-link’ constraints and ζ be the set of ‘cannot-link’ constraints.

4 Methods and experiment

In order to discuss how to partition the images of various leaves into groups and explore which is the key factor (or factors) among multiple features, several experiments were performed using a total of 192 digital images of ten species (see Table 1), with 16 or 32 images each (see Fig. 6). The images used in this study were collected in Wollaton Park, Nottingham, UK. The raw images are color images containing 4608*3456 pixels each. As a preprocessing step, we resized each of the images to 300*400 pixels. Then, a series of experimental studies were performed to estimate the validity of the whole framework with 192 leaves of 10 species.

There are three steps required to accomplish our experiment: construction of multiple feature datasets, SFFD clustering and performance analysis: (1) after data preprocessing, we captured 160 images of 300*400 pixels. Then a series of image preprocessing operations (Conversion to Gray-scale image, Conversion to Binary image, Image Segmentation, Image Smoothing) was undertaken to obtain three 160*64 multiple feature matrices called the margin feature dataset, shape feature dataset and texture feature dataset. Together, these generated a 160*192 matrix by taking all their columns together as combination feature dataset. Then, four 160*8 datasets for classification are acquired through PCA dimensionality reduction (see Fig. 7). (2) With the datasets of multiple features extracted from the images, we use the SFFD clustering algorithm to partition the input matrices into groups. The classification process was performed with up to 30% labeled data. (3) Performance analysis from four aspects includes weights analysis, accuracy comparisons, feature identification and NMI values was used to obtain a comprehensive understanding of our algorithm.

5 Performance analysis

Since SFFD is a weighted clustering algorithm, eight samples were chosen from the multiple features (2 samples each) to examine how the curves of their membership weights vary in different iterations of the clustering. The result shows that different features have different effects on weights (see Fig. 8). For instance, the curve of the texture feature data set is relatively weaker than the rest of the features in separating the sample from one class to another. Further, the curves for the margin feature data set and combination feature data set achieve the best performance.

Comparisons of the accuracy of the SFFD method with a PNN based method [39], an RBF based method [40] and an Incremental classification algorithm based method [12] were performed to evaluate the relative performance of SFFD (see Table 2), especially to identify the relationship among various features, features number and classification accuracy. The PNN based method [39] is a leaf classification algorithm using shape, vein, colour and texture features based on a probabilistic neural network (PNN). The RBF based method [40] utilizes edge and texture fusion for plant leaf classification based on a radial basis function (RBF), while the incremental classification algorithm based method [12] adopts this approach for shape and texture based leaf classification.

Table 2 implies that utilizing complex classifiers with more amounts of features may lead to higher classification accuracies. For example, PNN based methods [39] with 4 features achieves the maximum performance of 93.75% while algorithms with fewer features achieve a lower performance. However, the performance of the SFFD algorithm with different features demonstrates that various features have different contributions to classification accuracies. For instance, each SFFD algorithm with margin feature, shape and texture has one feature for leaf identification, but their classification accuracies are different. The algorithm with margin feature achieves the best result but the algorithm utilizing the texture feature gets the worst result.

To investigate which is the most important feature for leaf classification among multiple features under various clustering algorithms, we adopt three classical clustering algorithms (FCM, Gustafson-Kessel and Kmeans) and SFFD. Furthermore, in order to have a comprehensive view on the performance of SFFD and to identify the most important feature (in terms of contribution to classification performance), two popular approaches –Accuracy and the NMI measure –were utilized to analyze performance during the wholeprocess. Hence, classification accuracies and NMI values were obtained for every feature from multiple features (see Fig. 9). Figure 9 implies that the better choice of feature may lead to a better performance base on both accuracy and the NMI value. For example, based on either accuracy or the NMI value, all the results of the margin feature, the shape feature and the combination feature are better than the texture feature for all clustering algorithms.

Because of the potential importance of the numbers of pairwise constraints on enhancing the classification accuracy and NMI value, we investigated various numbers of constraints from 4 to 24 to test the effect that the labeled data brings (see Fig. 10). In Fig. 10, it can be seen that the classification accuracy of features reaches more than 80% with eight pairwise constraints (except for the texture feature), and the maximum increases to nearly 90% with only 30% labeled data. Their NMIvalues achieve more than 0.75 in the course ofclustering. However, the texture feature proves to have the least value in clustering, as its highest classification accuracy is only 73% and the maximum value of its NMI is always below 0.75.

Figure 10 also demonstrates that different evaluation measures may not lead to the same conclusions. For example, the combination feature data set obtains a relatively better performance on classification accuracy than shape feature data set in the range 4 to 16 pairwise constraints, whereas the latter got a higher NMI value under the same constraints. From this, we can infer that it is necessary to consider alternative evaluation approaches.

6 Discussion and conclusion

This paper constructs a framework for leaf classification using a novel semi-supervised fuzzy clustering algorithm with integrated feature discrimination (SFFD). The multiple feature datasets from digital images were extracted as input matrices. The SFFD approach is extremely useful in this context as it includes a feature weighting process to not only speed up the classification process but also improve the qualityof classification. The experimental results indicate our framework is workable with a small number of constraints for every multiple feature matrix on ten plant species. Further, these results suggest that under the same preprocessing method the margin feature and the combination feature, and to a lesser extent the shape feature, may be the best choice for leaf classification according to the specimens we utilized. However, experiments with different specimens may lead to various results as different specimens may be differing in one feature but similar in another feature. For example, leaves from different species may have very different shapes from each other but similar textures. In this case, the shape feature may be the best choice to identify them.

This paper explores the issue of a real-world leaf classification problem based on our proposed SFFD method, by taking multiple features into account. SFFD is a semi-supervised clustering algorithm applying an effective feature selection procedure by weighting the partition matrix and adopting the pairwise constraints provided by the user to guide the clustering process. We construct a framework not only for identifying the most important features in leaf classification but also for testing the performance of SFFD refers to its application on classification problems. The result shows that under the same algorithm with the same condition, the margin feature or shape feature should be the better choice for classification among the features extracted from the leaf image and the overall classification rate of SFFD algorithm achieves 83.59% with limited constraints.

Multiclass classification is always a challenge. As many of the leaves have very different features and the species belong to a number of different genera, the identification of leaves is a difficult problem for the high intra-species variability and low inter-species variation. Except for improving the recognition ability of classifier, the preprocessing method that makes the leaves well segmented from its background is also an important factor for identification rate. However, the approach adopted in this work demonstrates the importance of both the combination of various preprocessing methods and the choice of the approach for classification to maintain or improve the identification accuracy.

This paper is focused on describing a novel algorithm and performing some initial evaluation. We have therefore not attempted to also carry out detailed image preprocessing modeling to extract multiple leaf features. However, classification accuracy depends in part on the difficulty of the classification problem such as which species are considered, the number of specimens and which feature is taken into account. So in the future, we can work on not only consideration of how to obtain more effective multiple features by optimizing the preprocessing steps of the original images as much as possible but also the assessment of the classification effectiveness with different amount of specimens.

Footnotes

Acknowledgments

This work is partially supported by the National High-tech Research and Development Program (863 program: 2013AA10230402), National Natural Science Foundation(61402374), Chinese Scholarship Council and Research Foundation of Shaanxi Polytechnic Institute under Grant No.ZK11-34.

References

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.