Joint graph regularized extreme learning machine for multi-label image classification

Abstract

Extreme learning machine (ELM) has been proved to be an efficient and effective machine learning method for pattern classification and regression. However, ELM is mainly applied to traditional supervised learning problems. ELM is not commonly used in multi-label image classification. In this paper, we propose a joint graph regularized extreme learning machine (JGELM) by simultaneously considering the feature information and label correlation of data. Specifically, we exploit the feature distance and label correlation in the local neighborhood. To this end, a joint graph regularizer based on a newly designed graph Laplacian to characterize both properties is formulated and incorporated into the ELM objective. Four popular multi-label image data sets are employed to test the proposed method. The experimental results show that JGELM are competitive with state-of-the-art multi-label classification algorithms in terms of accuracy and efficiency.

Keywords

Extreme learning machine feature distance label correlation multi-label image classification

1. Introduction

Due to the development of Internet and visual data sharing websites, the available image databases have been dramatically increased in the last decade. Providing efficient solution to image classification has always been a major focus in computer vision [1, 2, 3, 4, 5, 6]. The recent state-of-the-art image classification methods mainly include support vector machine (SVM), spatial pyramid matching (SPM), locality constrained linear coding (LLC) and so on.

Support Vector Machine (SVM) was popular in the last two decades and they were designed to overcome the drawbacks of back propagation neural network (BPNN). However, the size of SVM model is usually large for a large training dataset because the number of selected support vectors increases when the size of training dataset increases. In addition, a SVM model of more support vectors takes longer execution time so that SVM may not fit the current requirements of a mathematical engine model.

One particular extension of the bag of features (BoF) model, called spatial pyramid matching (SPM) [7], has made a remarkable success on a range of image classification benchmarks like Caltech-101 [8] and Caltech-256 [3]. People have empirically found that, in order to obtain good performances, both BoF and SPM must be applied together with a particular type of nonlinear Mercer kernels, e.g. the intersection kernel or the Chi-square kernel. Accordingly, the nonlinear SVM has to pay a computational complexity O( $n^{3}$ ) and a memory complexity O( $n^{2}$ ) in the training phase, where $n$ is the size of training dataset. Furthermore, since the number of support vectors grows linearly with $n$ , the computational complexity in testing is O( $n$ ). This scalability implies a severe limitation – it is nontrivial to apply them to real-world applications, whose training size is typically far beyond thousands.

The locality constrained linear coding (LLC) algorithm [9] is an efficient local coordinate linear coding method, which projects each descriptor into a local constraint system to obtain an effective codebook or dictionary. It has been demonstrated that it is a promising image representation method. Experimental results for image classification based on several well-known dataset validate the good performance of LLC. The LLC descriptors encoding, in ensuring the constraint conditions of shift invariance, the reconstruction error need minimization characteristics, so the encoding may contain negative elements. If the negative elements in the encoding and the positive elements of difference, it will lead to the instability of coding.

Extreme learning machine (ELM) [10, 11], as a popular approach in recent years, has recently attracted the attention from more and more researchers [12, 13, 14]. Compared with traditional machine learning methods such as support vector machine (SVM) and spatial pyramid matching (SPM), it provides better generalization performance at a much faster learning speed and with least human intervening [15]. Though ELM has been well researched as a singular classifier, yet ELM ensembles are less explored for pattern classification tasks. However, recently such a demand has been raised by the so-called big data. This term refers to a collection of datasets so large and complex that it becomes awkward to work with using on hand database management tools [17].

In this paper, based on the idea that similar samples should share similar properties, we propose a joint graph regularized extreme learning machine (JGELM). In JGELM, the constraint imposed on output weights enforce the output of sample joint distance similarity and label correlation. The constraint is formulated as a regularized term being added on the objective of basic ELM model, which also makes the output weights be solved analytically. We perform our new method on multi-label classification benchmark data sets and compare the results state-of-the-art multi-label classification methods.

The remainder of this paper is organized as follows. Section 2 describes the basic extreme learning machine model as well as its $l_{2}$ -norm regularized version. Section 3 introduces the proposed a joint graph regularized extreme learning machine (JGELM) including its model formulation and optimization method. Section 4 gives the detailed experiments to evaluate the efficiency of applying JGELM to multi-label classification on several widely used data sets. Conclusion is given in Section 5.

2. Extreme learning machine

In this section, we review the extreme learning machine (ELM) in detail. Extreme learning machine proposed by Huang et al. [10, 11], is a simple learning machine for single-hidden layer feed forward neural network (SLFN).

Given a training set $L=\left\{{(x_{i},t_{i})\left|{x_{i}\in R^{d},t\in R^{c},i=1,2,\ldots,N}\right.% }\right\}$ , where $x_{i}=(x_{i1},x_{i2},\ldots,x_{id})^{T}$

and $t_{i}=(t_{i1},t_{i2},\ldots,t_{ic})$ . Where $d$ is dimension of sample eigen vector, $c$ is number of sample labeled classes. In ELM, the network input weights $W\in R^{k\times d}$ and the hidden layer biases $b\in R^{k}$ are randomly generated. An ELM with $k$ hidden neurons and activation function $h$ is modeled as [10, 11]

$\sum\limits_{j=1}^{k}{\beta_{j}}h_{j}(x_{i})=\sum\limits_{j=1}^{k}{\beta_{j}}h% (w_{j}\cdot x_{i}+b_{i})=o_{i},i=1,2,\ldots,N$ (1)

where $w_{j}=(w_{j1},w_{j2},\ldots,w_{jd})$ is input weight vector connecting the $j$ the hidden nodes with input nodes, $\beta_{j}=(\beta_{j1},\beta_{j2},\ldots,\beta_{jc})^{T}$ is output weight vector connecting the $j$ the hidden nodes with output nodes, and $o_{i}=(o_{i1},o_{i2},\ldots,o_{ic})^{T}$ is the networking output corresponding to sample $x_{i}$ . The ordinary ELM aims to minimize the objective

$\min\limits_{\beta}\left\|{\beta^{T}H-T}\right\|^{2}$ (2)

where $H$ is the hidden layer output matrix as

$\displaystyle H=\left[\begin{array}[]{l}h(w_{1}\cdot x_{1}+b_{1})\ldots h(w_{1% }\cdot x_{N}+b_{1})\\ h(w_{2}\cdot x_{1}+b_{2})\ldots h(w_{2}\cdot x_{N}+b_{2})\\ \vdots\vdots\vdots\\ h(w_{k}\cdot x_{1}+b_{k})\ldots h(w_{k}\cdot x_{N}+b_{k})\\ \end{array}\right]$ (3) $\displaystyle\beta=[\beta_{1}^{T},\beta_{2}^{T},\ldots,\beta_{k}^{T}]_{k\times c% }^{T}\text{ and }T=[t_{1},t_{2},\ldots,t_{N}]_{c\times N}$

Therefore, the output weights matrix $\beta$ can be estimated analytically by

$\widehat{\beta}=\arg\min\limits_{\beta}\left\|{\beta^{T}H-T}\right\|_{2}^{2}=H% ^{{\dagger}}T$ (4)

where $H^{{\dagger}}$ is the Moore-Penrose generalized inverse of $H$ . If $\textit{HH}^{T}$ is nonsingular, Eq. (5) can be written as

$\displaystyle\widehat{\beta}=(\textit{HH}^{T})^{-1}\textit{HT}^{T}$ (5)

In order to improve the stability and generalization performance of the ordinary ELM, Huang et al. proposed the equality constrained optimization-based NLM. In this method, the solution of regularized ELM can be expressed as

$\widehat{\beta}=\left(\textit{HH}+\frac{I}{\eta}\right)^{-1}\textit{HT}^{T}$ (6)

where $\eta$ is a constant and $I$ is the identity matrix.

The solution shown in Eq. (7) can be obtained by solving the following optimization problem.

$\min\limits_{\beta}\left\|{\beta^{T}H-T}\right\|_{2}^{2}+\frac{1}{\eta}\left\|% \beta\right\|_{2}^{2}$ (7)

where $\left\|\beta\right\|_{2}^{2}=\sum\nolimits_{j=1}^{K}{\left\|{\beta_{j}}\right% \|}_{2}^{2}$ is regarded as the regularization term and $\left\|{\beta_{j}}\right\|_{2}^{2}$ denotes $l_{2}$ -norm of vector $\beta_{j}$ . Moreover, $\eta$ denotes the regularization parameter to balance the influence of error term and the model complexity.

3. Joint feature neighbor graph and label relation graph regularized ELM

3.1 Feature neighbor graph

Given a set of $d$ -dimensional data points $x_{1},x_{2},\ldots,x_{m}$ , we can construct a nearest neighbor graph G with $m$ vertices, where each vertex represents a feature vector of a input samples. Let $W$ be the weight matrix of G. If $x_{i}$ is among the k-nearest neighbors with feature distance of $x_{j}$ or $x_{j}$ is among the k-nearest neighbors with feature distance of $x_{i}$ , $W_{ij}=1$ , otherwise, $W_{ij}=0$ . We define of $x_{i}$ as $d_{i}=\sum\nolimits_{j=1}^{m}{W_{ij}}$ , and $D=\textit{diag}(d_{1},d_{2},\ldots,d_{m})$ .

Considering the problem of mapping the weighted graph G to the sparse representations $Y$ , a reasonable criterion for choosing a “good” map is to minimize the following objective function

$\min\frac{1}{2}\sum\limits_{i=1}^{m}{\sum\limits_{j=1}^{m}{(y_{i}}}-y_{j})^{2}% W_{ij}=\textit{Tr}(Y^{T}LY)$ (8)

where $L=D-W$ [18] is the Laplacian matrix, $y_{i}$ and $y_{j}$ are the predictions with respect to pattern $x_{i}$ and $x_{j}$ , respectively, where $Y=\beta^{T}H$ in extreme learning machine setting.

As discussed in [18], instead of using L directly, we can normalized it by $D^{-\frac{1}{2}}LD^{-\frac{1}{2}}$ , or replace it by $L^{p}(p$ is an integer), based on some prior knowledge.

3.2 Label relation graph

Given a set of $d$ -dimensional data points $x_{1},x_{2},\ldots,x_{m}$ , we can construct a label relation graph. Without losing of generality, we assume each training data is labeled with a number of annotations $R_{i}=\{r_{1},\ldots,r_{c}\}$ represented $r_{j}\in\{0,1\}^{c}$ , such that $R_{i}(c)=1$ if $x_{i}$ is annotated with the $c$ -the class and 0 otherwise, $\forall i=1,2,\ldots,m$ .

We utilize the following cosine similarity to calculate label affinity matrix

$A(i,j)=\cos(x_{i},x_{j})=\frac{\langle R_{i},R_{j}\rangle}{(\left\|{R_{i}}% \right\|\times\left\|{R_{j}}\right\|)}$ (9)

Note that similar to the neighbor weighted graph G the sparse representations Y, a reasonable criterion for choosing a “good” map is to minimize the following objective function [17]

$\min\frac{1}{2}\sum\limits_{i=1}^{m}{\sum\limits_{j=1}^{m}{(y_{i}}}-y_{j})^{2}% W_{ij}=\textit{Tr}(Y^{T}AY)$ (10)

where $y_{i}$ and $y_{j}$ are the predictions with respect to pattern $x_{i}$ and $x_{j}$ , respectively.

Where $Y=\beta^{T}H$ in Extreme Learning Machine setting.

3.3 The proposed JGELM model

The proposed JGELM, by modifying the ordinary ELM Eq. (4), we give the formulation of JGELM as:

$\displaystyle\min\limits_{\beta}\left\|{\beta^{T}H-T}\right\|_{2}^{2}+\lambda_% {1}\textit{Tr}(\beta^{T}\textit{HLH}^{T}\beta)+\lambda_{2}\textit{Tr}(\beta^{T% }\textit{HAH}^{T}\beta)+\lambda_{3}\left\|\beta\right\|_{2}^{2}$ (11)

where $\textit{Tr}\left({\beta^{T}\textit{HLH}^{T}\beta}\right)$ is the feature neighbor graph regularization term, $\textit{Tr}({\beta^{T}\textit{HAH}^{T}\beta})$ is the label relation graph regularization term, $\left\|\beta\right\|_{2}^{2}$ is the $l_{2}$ -norm regularization term, the $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are regularization parameters to balance the impact of these three terms. If we set $\lambda_{2}=0$ , Eq. (11) becomes the formula for a discriminative graph regularized extreme learning machine (GELM) [19].

set $F\left\|{\beta^{T}H-T}\right\|_{2}^{2}+\lambda_{1}\textit{Tr}(\beta^{T}\textit% {HLH}^{T}\beta)+\lambda_{2}\textit{Tr}(\beta^{T}\textit{HAH}^{T}\beta)+\lambda% _{3}\left\|\beta\right\|_{2}^{2}$ and we can obtain $\beta$ by setting the differentiate of the objective function $F$ with respect to $\beta$ zero as follows:

$\displaystyle\frac{\partial F}{\partial\beta}=\frac{\partial}{\partial\beta}% \textit{Tr}[(\beta^{T}H-T)^{T}(\beta H-T)]+\lambda_{1}\textit{Tr}(\beta^{T}% \textit{HLH}^{T}\beta)+\lambda_{2}\textit{Tr}(\beta^{T}\textit{HAH}^{T}\beta)+% \lambda_{3}\left\|\beta\right\|_{2}^{2}=2\textit{HH}^{T}\beta-2HT+2\lambda_{1}% \textit{HLH}^{T}\beta+2\lambda_{2}\textit{HAH}^{T}\beta+2\lambda_{3}\beta\triangleq 0$ (12)

As a result, we have

$\beta=(\textit{HH}^{T}+\lambda_{1}\textit{HLH}^{T}+\lambda_{2}\textit{HAH}^{T}% +\lambda_{3}I)^{-1}\textit{HT}^{T}$ (13)

The algorithm description of our proposed JGELM is summarized in Algorithm 1.

Algorithm 1: The JGELM algorithm
INPUT:	The training set $N=\{(x_{i},r_{i})\left\|{x_{i}}\right.\in\mathbb{R}^{d},r_{i}\in\left\{{0,1}% \right\}^{c},i=1,2,\ldots,N\}$ , activation function $g$ , number of hidden nodes $k$ , and regularization parameters $\lambda_{1},\lambda_{2}$ and $\lambda_{3}$ ;
OUPUT:	Output weight matrix $\beta$ ;
Step 1:	Construct the feature neighbor graph laplacian matrix L;
Step 2:	Construct the label relation graph laplacian matrix A;
Step 3:	Randomly assign input weights $w_{j}$ and biases $b_{j},j=1,\ldots,k$ , and calculate the output matrix of the hidden neurons H;
Setp 4:	Calculate the output weight matrix $\beta$ according to Eq. (13)

4. Experimental results

4.1 Experiment data

We test the JGELM on four popular multi-label image data sets, which have been widely used for evaluating multi-label learning algorithms.

Barcelona image data set is composed of urban scenes from Barcelon, and consists of 139 urban scene images in “jpeg” format with minimum resolution of $\mathrm{1600\times 1200}$ . The Barcelona data sets has 4 overlapping labels “buildings”, “Flora”, “People” and “Sky”. Each image is represented by a feature vector of 778 dimensions using the concatenation of LBP [9] and GIST [10].

Nature scene data set [2] contains 2407 images represented by a 294_dimensional vector, which are labeled with 6 semantic concepts.

PASCAL VOC 2007 is an extension visual object recognition challenge data based on PASCAL VOC 2006. It has 9663 images with 4 group annotations and each group can be further divided into the following classes, Person: person; Animal: bird, cat, cow, dog, horse, sheep; Vechicle: bicycle, boat, bus, car, motorbike, train; Indoor: bottle, chair, dining; diningtable, potted, plant, sofa, tv/monitor. we download the 512_dimnesion Gist feature and rgb 4096 as the image descriptor extracted from all the image.

MIR FLICKR2008 is public image data set used for ACM sponsored image retrieval evaluation. It has 25000 images with 38 classes downloaded from the social photography site Flickr through its public API. After removing the most common annotations, i.e. colors, seasons and place names, the average number of annotation per image is 8.94. In the collection there are 1386 annotations which occur in at least 20 images. We download the 512-dimension GIST image descriptor extracted from all the images.

We summarized the data as listed in Table 1.

Table 1
Data sets summary

Data sets	Samples (n)	Features (d)	Labels (c)
BARCELONA	139	778 (GIST $+$ LBP)	4
SCENE	2407	294	6
PASCAL07	9963	4608 (GIST $+$ RGb)	20
MIRFLICKR08	25000	4608 (GIST $+$ RGB)	38

4.2 Experimental setup

In all experiments, we use 5-fold cross validation. Specifically, we split the data evenly into 5 folds and take choosing 4 folds for training and using the remaining 1 fold for testing. In each training step, we further divide the training data into 5 parts and pick up 4 parts for testing and choose the remaining 1 pat as the validation to tune the best regularization. We repeat the above procedure 5 times and report the average classification results.

We experiment prosed JGELM on four multi-label image data sets with differen combinations of parameters $\lambda_{1},\lambda_{2}$ and $\lambda_{3}$ while fixing the number of hidden nodes as $k=[100,500,1000]$ . For example, JGELM achieves consistently good performance for $\lambda_{1}=[2^{-3},2^{-2},\ldots,2^{4}]$ , $\lambda_{2}=[{2^{-1},2^{0},\ldots,2^{3}}]$ and $\lambda_{3}=[{10^{-5},10^{-4},\ldots,10^{-1}}]$ .

4.3 Multi-label classification results

We evaluate the performance of the algorithm by using the accuracy of multi-labeled image classification.

Table 2
Classification performance comparison on the four multi-label image data sets

Data	SVM	LLC	ELM	GELM	JGELM
BARCELONA	74.76	88.23	87.37	89.82	93.41
SCENE	66.35	77.93	75.54	79.01	80.11
PASCAL07	64.77	75.92	70.88	71.67	73.49
MIRFLICKR08	58.89	71.61	66.73	68.96	67.28

Table 2 report the accuracies of five algorithms on four data sets. Our experimental results have demonstrated that our proposed JGELM model possesses excellent performance in multi-labeled image classification with conventional ELM. The JGELM algorithm performs better on a less classified database. When the amount of data and categories are more, our algorithm is less than LLC.

5. Conclusions

In this paper, we have proposed JGELM, to extend the traditional ELM for Multi-Label Classification. We propose a joint graph regularized extreme learning machine (JGELM) by simultaneously considering the feature information and label correlation of data. Specifically, we exploit the feature distance and label correlation in the local neighborhood. Compared to existing multi-label algorithms, the proposed JGELM maintains almost all the advantages of elms, such as the remarkable training efficiency and direct implementation for multi-class classification problems. It also led to competitive results with several state-of-the-art multi-label classification algorithms, and it required significantly less training time. The JGELM are expected to greatly expand the applicability of ELM, and provide new insights into the extreme learning paradigm.

References

Kobayashi

, BoF meets HOG: Feature extraction based on histograms of oriented pdf gradients for image classification, Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013), 747–754.

The PASCAL Visual Object Classes Challenge 2007 (VOC2007). http://www.pascal-network.org/challenges/VOC/voc2007/index.html.

Griffin

Holub

and Perona

, Caltech-256 object category dataset, Technical Report7694, Caltech, 2007.

Robust classification of objects, faces, and flowers using natural image statistics, In CVPR (2010), 2472–2479.

Lazebnik

Schmid

and Ponce

, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, In CVPR (2006), 2169–2178.

L.-J.

and Li

F.-F.

, What, where and who? Classifying events by scene and object recognition, In ICCV (2007), 1–8.

Lazebnik

Schmid

and Ponce

, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, In CVPR (2006).

F.-F.

Fergus

and Perona

, Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories, In CVPR Workshop on Generative-Model Based Vision (2004).

Wang

Yang

Huang

and Gong

, Locality constrained linear coding for image classication, In Proc IEEE Conf Comput Vis Pattern Recognit (Jun 2010), 3360–3367.

10.

Huang

G.B.

Zhu

Q.Y.

and Siew

C.K.

, Extreme learning machine: Theory andapplications, Neurocomputing 70 (Dec 2006), 489–501.

11.

Huang

G.B.

Zhou

H.M.

Ding

X.J.

and Zhang

, Extreme learning machine for regression and multiclass classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42 (Apr 2012), 513–529.

12.

Mehrizi

and Yazdi

H.S.

, Semi-supervised GSOM integrated with extreme learning machine, Intell Data Anal 20(5) (2016), 1115–1132.

13.

Chen

Yang

Gao

and Yu

, Classification of imbalanced bioinformatics data by using boundary movement-based ELM, Bio-Medical Materials and Engineering 26 (2015), S1855–S1862.

14.

Zhang

and Zhou

, Hematocrit estimation using online sequential extreme learning machine, Bio-Medical Materials and Engineering 26 (2015), S2025–S2032.

15.

Huang

G.B.

Wang

D.H.

and Lan

, Extreme learning machines: A survey, Int J Mach Learn Cybern 2(2) (2011), 107–122.

16.

Huang

G.B.

Zhu

Q.Y.

and Siew

C.K.

, Extreme learning machine: A new learning scheme of feed forward neural networks, In Proceedings of IEEE International Joint Conferenceon Neural Networks 2 (2004), 985–990.

17.

Zheng

W.B.

Qian

Y.T.

and Lu

H.J.

, Text categorization based on regularization extreme learning machine, Neural Comput Appl (2012), 1–10.

18.

Chung

F.R.

, Spectral graph theory, In CBMS Regional Conference Series in Mathematics 92.

19.

Peng

Wang

Long

and Lu

B.-L.

, Discriminative graph regularized extreme learning machine and its application to face recognition, Neurocomputing 149 (2015), 340–353.

Joint graph regularized extreme learning machine for multi-label image classification

Abstract

Keywords

1. Introduction

2. Extreme learning machine

3.1 Feature neighbor graph

4.1 Experiment data

Table 1 Data sets summary

4.3 Multi-label classification results

Table 2 Classification performance comparison on the four multi-label image data sets

References

Table 1
Data sets summary

Table 2
Classification performance comparison on the four multi-label image data sets