Multiple rank multi-linear twin support matrix classification machine 1

Abstract

In this paper, a novel learning frameworks–multiple rank multi-linear twin support matrix classification machine (MRMLTSMCM) is outlined, as an extension of twin support vector machine (TWSVM). Different from TWSVM, MRMLTSMCM uses two pairs of projecting matrixes to construct the pair of functions, which are used to establish decision function. Compared with the vector-based method, the matrix-based could not only keep the structure of the matrix data but also reduce computational complexity. In addition, a regularization term is considered adding to improve the performance of MRMLTSMCM. Moreover, a novel algorithm for MRMLTSMCM is introduced. Finally, experimental results show the effectiveness of the method by classification accuracy, convergence behavior and computation time.

Keywords

Classification matrix learning tensor learning

1 Introduction

Nowadays, the tensors, as common forms are more and more widely used in various kinds of real applications. For example, visual recognition [1, 2], face recognition [3] photorealistic image of palms [4], medical images [5, 6] and so on. In fact, the matrix is a two order tensor, which can build a bridge between the vector and the higher order tensor. So we mainly consider the classification problem with the matrix inputs in this paper.

However, traditional classification methods are usually vector-based. The representation of inputs in algorithm is assumed as the vector, not matrix or tensor. When they are applied to matrix data, the matrix data have to be transformed to vector inputs. The most commonly way is to reformulate a vector by connecting each row or column of a matrix.

Although traditional classification methods have achieved satisfactory result in many cases, they may lack efficiency when the matrix data are transformed into vector form data [7]. There are many reasons: firstly, the structural information of these matrix data would be destroyed; secondly, the computation time will increase drastically with the increase of dimensionality because the dimensionality of reformulated vector is often very high; thirdly, the spatial correlations of the matrix will be lost when a matrix is collapsed as a vector. In order to settle these problems, there are mainly two types of methods. The first type of methods is projecting matrixes into vectors to reduce dimensionality, such as principal component analysis (PCA) [8], linear discriminant analysis (LDA) [9]. Nevertheless, since most of these methods are unsupervised, label information will be lost in learning subspace. The other type of method, support tensor machine (STM) [10 –14], is extended by support vector machine (SVM) [15 –18]. STM classifies tensor inputs directly. Obviously, the computation time is reduced in a large scale, however, there maybe some problems, such as large training error and under fitting. Recently, multiple rank multi-linear SVM (MRMLSVM) [7] combine SVM with STM, introduce a novel matrix classification model, using left and right multiple rank projections instead of STM’s left and right projecting vectors. MRMLSVM has achieved more higher accuracy compared with SVM and STM and less computation time compared with SVM in many matrix data classification. Nevertheless, the results of MRMLSVM for some matrix data classification are not too well, e.g., image of palms. As an extension of MRMLSVM, a novel nonlinear classifier for linear inseparable multiple rank matrix data classification problem by means of a matrix kernel function, named as multiple rank multi-linear kernel SVM(MRMLKSVM) [19]. Different from MRMLKSVM, this paper focuses on the multiple rank multi-linear structure of the matrix data and has a try to improve the performance for matrix data classification, seeking a pair of nonparallel hyperplanes by solving two smaller programming problems such that each one is closer to one of the two classes and keep away from the other. Besides, this paper introduces a novel matrix classification model, using left and right multiple rank projections instead of the left and right projecting vectors of twin support tensor machines (TSTM). It’s called multiple rank multi-linear twin support matrix machine (MRMLTSMCM). Similar to MRMLSVM, MRMLTSMCM is based on matrix input and stop the structural information of matrix data from being destroyed, as well improve the performance of model by choosing a suitable rank for left and right projecting matrix. However, there are some difference: 1), MRMLTSMCM constructs two functions by solving two smaller programming problems. 2), for each objective function of MRMLTSMCM, a regularization term is added and the parameter of regularization term can play a role as a weighting factor which determines the tradeoff between the regularization term and empirical risk. Both on the vector and the matrix inputs, compared with SVM, MRMLSVM, TWSVM, TSTM, MRMLSTMCM obtains the best classification accuracies on different training sets from UCI datasets [20].

The paper is organized as follows: in Section 2, background and methods are introduced, such as linear TWSVM and MRMLTSMCM, the model definition of MRMLTSMCM and corresponding algorithm is developed, in addition, the motivation and extension will be showed too. Computational comparisons on UCI datasets are done in Section 3. Section 4 gives some concluding remarks.

2 Background and methods

2.1 Background

In this section, firstly, two representative models related with our method will be introduced, linear TWSVM and MRMLSVM. For convenience, thefollowing several definitions are given:

Definition 1. Vec (·) is the vectorization of a matrix by connecting each column of this matrix. For a matrix A = [A ₁, A ₂, ⋯, A _n] ∈ R ^m×n, $Vec (A) = [A_{1}^{T}, A_{2}^{T}, \dots, A_{n}^{T}]^{T} \in R^{mn},$ where A _i is the i-th column of the matrix A.

Definition 2. Mat (·) is the matrixing of a vector. For a vector $a = [a_{1}^{T}, a_{2}^{T}, \dots, a_{n}^{T}]^{T}$ ∈R ^mn, $Mat (a) = [a_{1}, a_{2}, \dots, a_{n}] \in R^{m \times n},$ where a _i ∈ R ^m is a vector, i = 1, 2, ⋯, n.

Definition 3. Singular value decomposition (SVD). For a matrix A ∈ R ^m×n, if the rank of A is r, then there are two orthogonal matrix U ∈ R ^m×m and V ∈ R ^n×n $\begin{array}{l} s .t . & A = U (\begin{array}{l} Δ & 0 \\ 0 & 0 \end{array}) \end{array}$ where $▵ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{r}^{2})$ , and σ ₁ ≥ σ ₂ ≥ ⋯ ≥ σ _r > 0 are singular values.

2.2 Twin Support Vector Machine

Formally, let $T = {(x_{i}, y_{i}) | i = 1, 2, \dots, t},$ (1)be a set of vector sample data(training data), where x _i ∈ R ⁿ is the vector input and y _i ∈ {-1, 1} is the corresponding output, and the first l inputs belong to the positive class and the rest inputs belong to the negative class. Linear TWSVM [21] finds twonon-parallel hyperplanes $w_{1}^{T} x + b_{1} = 0, w_{2}^{T} x + b_{2} = 0,$ (2)where w ₁, w ₂ ∈ R ⁿ are normal vectors and b ₁, b ₂ ∈ R are bias terms to the two hyperplanes, respectively. The two hyperplanes are obtained via solving the following pair of QP problems: $\begin{matrix} min_{w_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (w_{1}^{T} x_{i} + b_{1})^{2} + C_{1} \sum_{j = l + 1}^{t} ξ_{j}, \\ s . t . & - (w_{1}^{T} x_{j} + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}$ (3)and $\begin{matrix} min_{w_{2}, b_{2}, ξ_{i}} & \frac{1}{2} \sum_{j = l + 1}^{t} (w_{2}^{T} x_{j} + b_{2})^{2} + C_{2} \sum_{i = 1}^{l} ξ_{i}, \\ s . t . & w_{2}^{T} x_{i} + b_{2} \geq 1 - ξ_{i}, \\ ξ_{i} \geq 0, i = 1, 2, \dots, l, \end{matrix}$ (4)where C ₁ > 0, C ₂ > 0 are the penalty parameters. The use of the slack variable ξ = (ξ ₁, ξ ₂, ⋯, ξ _t) is similar to SVM. See more details in [20].

The decision function is represented as $f (x) = \arg \min_{r = 1, 2} \frac{| w_{r}^{T} x + b_{r} |}{∥ w_{r} ∥,}$ (5)where | · | is the absolute value, and r = 1, 2 corresponds to the labels {+1, - 1} of the positive and negative class, respectively.

2.3 MRMLSVM

Let $T = {(X_{i}, y_{i}) | i = 1, 2, \dots, t},$ (6)be a set of matrix sample data(training data), where X _i ∈ R ^m×n is the matrix input and y _i ∈ {-1, 1} is the corresponding output. MRMLSVM [7] employs multiple-rank left and right projecting vectors, U = [u ₁, u ₂, …, u _K] and V = [v ₁, v ₂, …, v _K] instead of u and v, in the projecting vectors of STM, to construct decision function. By using the trace operator, the formulation of MRMLSVM is represented as the following optimization problem:

$\begin{matrix} min_{U, V, b, ξ_{i}} & \frac{1}{2} Tr ({UV}^{T} {VU}^{T}) + C_{1} \sum_{i = 1}^{t} ξ_{i}, \\ s . t . & y_{i} (Tr (U^{T} X_{i} V) + b) \geq 1 - ξ_{i}, \\ ξ_{j} \geq 0, i = 1, 2, \dots, t, \end{matrix}$ (7)Where U = [u ₁, u ₂, …, u _K] and V = [v ₁, v ₂, …, v _K] are the left and right projecting matrices. K is a parameter, which indicates the rank of U or V, and affects the performance of MRMLSVM. Compared with SVM, MRMLSVM has lower computational complexity on account of less variables. Compared with STM, MRMLSVM achieve higher accuracy by helping improve the performance under fitting problem. See more details in [7].

2.4 MRMLTSMCM

Next, the multiple rank multi-linear twin support matrix classification machine is developed, shorten as MRMLTSMCM, to solve the binary classification problem with the matrix inputs. For convenience, the training set (6) is rearranged as the following form

$T = {(X_{1}, y_{1}), \dots, (X_{l}, y_{l}), \dots, (X_{t}, y_{t})},$ (8)where X _i ∈ R ^m×n is the input, y _i ∈ {-1, 1} is the corresponding output, and the first l points belong to the positive class and the rest points belong to the negative class, that is, y _i = 1, i = 1, ⋯, l, y _i = -1, i = l + 1, ⋯, t. The task is to find the two following functions:

$g_{1} (X) = Tr (U_{1}^{T} {XV}_{1}) + b_{1}, g_{2} (X) = Tr (U_{2}^{T} {XV}_{2}) + b_{2},$ (9)where U ₁, U ₂ ∈ R ^m×K and V ₁, V ₂ ∈ R ^n×K represent the left and right projecting matrices, respectively. In order to get the two functions, the two following optimization problems are constructed:

$\begin{matrix} min_{U_{1}, V_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (Tr (U_{1}^{T} X_{i} V_{1}) + b_{1})^{2} + C_{1} \sum_{j = l + 1}^{t} ξ_{j}, \\ + \frac{C_{3}}{2} (∥ U_{1} V_{1}^{T} ∥^{2} + b_{1}^{2}), \\ s . t . & - (Tr (U_{1}^{T} X_{j} V_{1}) + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}$ (10)and $\begin{matrix} min_{U_{2}, V_{2}, b_{2}, ξ_{i}} & \frac{1}{2} \sum_{j = l + 1}^{t} (Tr (U_{2}^{T} X_{j} V_{2}) + b_{2})^{2} + C_{2} \sum_{i = 1}^{l} ξ_{i}, \\ + \frac{C_{4}}{2} (∥ U_{2} V_{2}^{T} ∥^{2} + b_{2}^{2}), \\ s . t . & Tr (U_{2}^{T} X_{i} V_{2}) + b_{2} \geq 1 - ξ_{i}, \\ ξ_{i} \geq 0, i = 1, 2, \dots, l, \end{matrix}$ (11)

where U ₁ = [u _1,1, u _1,2, ⋯, u _1,K], U ₂ = [u _2,1, u _2,2, ⋯, u _2,K] ∈R ^m×K, V ₁ = [v _1,1, v _1,2, ⋯, v _1,K], V ₂ = [v _2,1, v _2,2, ⋯, v _2,K] ∈R ^n×K, and u _1,k ∈ R ^m, u _2,k ∈ R ^m, v _1,k ∈ R ⁿ, v _2,k ∈ R ⁿ, k = 1, 2, ⋯, K. The parameter K is the rank of U ₁, U ₂ and V ₁, V ₂. C ₁, C ₂ are the penalty parameters and C ₃, C ₄ are the regularization term parameters.

Now, the optimizing alternative algorithm will be used to solve the pair of optimization problems (10)–(11). Now, the optimization problem (10) is just considered to solve because the ways to solve the optimization problems (10) and (11) are similarly.

For the optimization problem (10), firstly, fixing V ₁, solving U ₁ and b ₁. Denote

${\bar{f}}_{1, i} = [v_{1, 1}^{T} X_{i}^{T}, v_{1, 2}^{T} X_{i}^{T}, \dots, v_{1, K}^{T} X_{i}]_{mK \times 1}^{T},$ (12)and

$\begin{matrix} {\bar{u}}_{1} = [u_{1, 1}^{T}, u_{1, 2}^{T}, \dots, u_{1, K}^{T}]^{T}, \\ {\bar{v}}_{1} = [v_{1, 1}^{T}, v_{1, 2}^{T}, \dots, v_{1, K}^{T}]^{T}, \end{matrix}$ (13)the optimization problem (10) is transformed as follows:

$\begin{matrix} min_{\bar{u_{1}}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} ({\bar{u}}_{1}^{T} {\bar{f}}_{1, i} + b_{1})^{2} + C_{1} \sum_{j = l + 1}^{t} ξ_{j}, \\ + \frac{C_{3}}{2} ({\bar{u}}_{1}^{T} D_{1} {\bar{u}}_{1} + b_{1}^{2}), \\ s . t . & - ({\bar{u}}_{1}^{T} {\bar{f}}_{1, j} + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}$ (14)where $D_{1} = (V_{1}^{T} V_{1}) \otimes I_{m \times m},$ (15)For $∥ {\bar{u}}_{1} {\bar{v}}_{1}^{T} ∥^{2} = (Vec (U_{1}))^{T} ((V_{1}^{T} V_{1}) \otimes I_{m \times m}) Vec (U_{1})$ , denote

$u_{1} = D_{1}^{1 / 2} {\bar{u}}_{1},$ (16) $f_{1} = D_{1}^{- 1 / 2} {\bar{f}}_{1},$ (17)the optimization problem (14) is reformulated as follows: $\begin{matrix} min_{u_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (u_{1}^{T} f_{1, i} + b_{1})^{2} + C_{1} \sum_{j = l + 1}^{t} ξ_{j} +, \\ \frac{C_{3}}{2} (∥ u_{1} ∥^{2} + b_{1}^{2}), \\ s . t . & - (u_{1}^{T} f_{1, j} + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}$ (18)and then transform the optimization problem (18) into the matrix formulation

$\begin{matrix} min_{u_{1}, b_{1}, ξ} & \frac{1}{2} ({Au}_{1} + e_{1} b_{1})^{T} ({Au}_{1} + e_{1} b_{1}) + C_{1} ξ^{T} e_{2} +, \\ \frac{C_{3}}{2} (∥ u_{1} ∥^{2} + b_{1}^{2}), \\ s . t . & - ({Bu}_{1} + e_{2} b_{1}) \geq e_{2} - ξ, \end{matrix}$ (19)where matrix A = [f _1,1, f _1,2, ⋯, f _1,l] ^T ∈ R ^l×mK, the matrix B = [f _1,l+1, f _1,l+2, ⋯, f _1,t] ^T ∈ R ^(t-l)×mK. ξ = [ξ _l+1, ξ _l+2, ⋯, ξ _t] ^T is the slack vector. The Lagrange function of the optimization problem (19) is given by

$\begin{matrix} L (u_{1}, b_{1}, α, β, ξ) = & \frac{1}{2} ({Au}_{1} + e_{1} b_{1})^{T} ({Au}_{1} + e_{1} b_{1}), \\ + C_{1} ξ^{T} e_{2} + \frac{C_{3}}{2} (∥ u_{1} ∥^{2} + b_{1}^{2}), \\ - α^{T} (- {Bu}_{1} - e_{2} b_{1} - e_{2} + ξ), \\ - β^{T} ξ, \end{matrix}$ (20)where α = [α _j+1, α _j+2, ⋯, α _t] ^T ≥ 0 and β = [β _j+1, β _j+2, ⋯, β _t] ^T ≥ 0 are the vectors of Lagrange multipliers. The Karush-Kuhn-Tucher (KKT) conditions are given by

$= 0,$ (21) $\nabla_{b_{1}} L = e_{1}^{T} ({Au}_{1} + e_{1} b_{1}) + C_{3} b_{1} + e_{2}^{T} α = 0,$ (22) $\nabla_{ξ} L = C_{1} e_{2} - β - α = 0,$ (23) $- ({Bu}_{1} + e_{2} b_{1}) \geq e_{2} - ξ, β^{T} α = 0,$ (24) $β \geq 0, α \geq 0 .$ (25)From the Eqs. (24)–(25) we have $0 \leq α \leq C_{1} e_{2} .$ And from the Eqs. (22)–(23) we have

$([A e_{1}]^{T} [A e_{1}] + C_{3} I) [u_{1}^{T}, b_{1}]^{T} + [B e_{2}]^{T} α = 0,$ (26)where I is an identity matrix of appropriate dimensions. Defining H = [A e ₁], G = [B e ₂], from the Eq. (26) we have:

$[u_{1}^{T}, b_{1}]^{T} = - (H^{T} H + C_{3} I)^{- 1} G^{T} α,$ (27)where C ₃ is the regularization term parameter, which is used to keep the matrix (H ^T H + C ₃ I) always invertible. Then putting the Eq. (27) into the Eq. (20) and using the Eqs. (21)–(25), obtaining the dual problem of the optimization problem (19)

$\begin{matrix} min_{α} & \frac{1}{2} α^{T} G (H^{T} H + C_{3} I)^{- 1} G^{T} α - e_{2}^{T} α, \\ s . t . & e_{2}^{T} α \geq 0, \\ 0 \leq α \leq C_{1} e_{2} . \end{matrix}$ (28) α can be obtained from the solutions of the QP problem (28), and then get u ₁ and b ₁ by the Eq. (27). From the Eqs. (13), (15),(17), we can get

$U_{1} = Mat (((V_{1}^{T} V_{1}) \otimes I_{m \times m})^{- 1 / 2} u_{1}) .$ (29)Secondly, fixing U ₁ and optimizing V ₁ and b ₁ in the same way with the above procedure. Here, just give the following optimization problem when U ₁ is fixed.

$\begin{matrix} min_{\hat{α}} & \frac{1}{2} {\hat{α}}^{T} \hat{G} ({\hat{H}}^{T} \hat{H} + C_{3} I)^{- 1} {\hat{G}}^{T} \hat{α} - e_{2}^{T} \hat{α}, \\ s . t . & e_{2}^{T} \hat{α} \geq 0, \\ 0 \leq \hat{α} \leq C_{1} e_{2}, \end{matrix}$ (30)where the matrix $\hat{A} = [g_{1, 1}, g_{1, 2}, \dots, g_{1, l}]^{T} \in R^{l \times nK}$ , and the matrix $\hat{B} = [g_{1, l + 1}, g_{1, l + 2}$ , ⋯, g _1,t] ^T ∈ R ^(t-l)×nK,

$g_{1, i} = {[\begin{matrix} X_{i}^{T} u_{1, 1} \\ X_{i}^{T} u_{1, 2} \\ ⋮ \\ X_{i}^{T} u_{1, K} \end{matrix}]}_{nK \times 1},$ (31)and

$[v_{1}^{T}, b_{1}]^{T} = - ({\hat{H}}^{T} \hat{H} + C_{3} I)^{- 1} {\hat{G}}^{T} \hat{α},$ (32)where $\hat{H} = [\hat{A} e_{1}], \hat{G} = [\hat{B} e_{2}]$ . According to the Eq. (32), getting $V_{1} = Mat (((U_{1}^{T} U_{1}) \otimes I_{n \times n})^{- 1 / 2} v_{1}) .$ (33)

Next, repeating the above two steps until all of V ₁ and U ₁ are convergent, b ₁ will be convergent too. The same way will be used to obtain U ₂, V ₂, b ₂ by solving the optimization problem (11).

Finally, constructing the decision function $f (X) = \arg \min_{r = 1, 2} \frac{| T r (U_{r}^{T} X V_{r}) + b_{r} |}{∥ V e c (U_{r} V_{r}^{T}) ∥ .}$ (34)

The above procedure can be summarized as following algorithm:

Algorithm 1

The procedure of MRMLTWSMCM.

For the training set (8) and the parameter K ≤ min {m, n}, select the penalty parameter C ₁, C ₃ > 0;

Initialize $V_{1} \leftarrow V_{1}^{(k)}$ , let k = 0

Solve the QP problem (28) and get the solution α ^*, compute

$[(u_{1}^{(k)})^{T}, b_{1}^{(k)}]^{T} = - (H^{T} H + C_{3} I)^{- 1} G^{T} α^{*}$

and

$U_{1}^{(k)} = Mat ((((V_{1}^{(k)})^{T} V_{1}^{(k)}) \otimes I_{m \times m})^{- 1 / 2} u_{1}^{(k)})$

$U_{1} \leftarrow U_{1}^{(k)}$ , solve the QP problems (30) and get the solution ${\hat{α}}^{*}$ , compute

$[(v_{1}^{(k + 1)})^{T}, b_{1}^{(k + 1)}]^{T} = - ({\hat{H}}^{T} \hat{H} + C_{3} I)^{- 1} {\hat{G}}^{T} {\hat{α}}^{*}$

and

$V_{1}^{(k + 1)} = Mat (((U_{1}^{T} U_{1}) \otimes I_{n \times n})^{- 1 / 2} v_{1}^{(k + 1)})$

If $∥ U_{1}^{(k + 1)} (V_{1}^{(k + 1)})^{T} - U_{1}^{(k)} (V_{1}^{(k)})^{T} ∥ \leq \in$ , stop and $U_{1} \leftarrow U_{1}^{(k + 1)}, V_{1} \leftarrow V_{1}^{(k + 1)}, b_{1} \leftarrow b_{1}^{(k + 1)}$ , otherwise k ← k + 1 and go to the step 3

Compute U ₂, V ₂, b ₂ by solving the optimization problems (11) similarly;

Output the decision function: $f (X) = \arg min_{r = 1, 2} \frac{| Tr (U_{r}^{T} {XV}_{r}) + b_{r} |}{∥ Vec (U_{r} V_{r}^{T}) ∥}$ .

2.5 Motivation of MRMLTSMCM

A number of motivations for MRMLTSMCM are outlined here.

2.5.1 Efficiency

For matrix data, compared with the traditional approaches, MRMLTSMCM uses the left and right projecting matrices U ₁, V ₁ and U ₂, V ₂ instead of traditional projecting vector w. The efficiency of MRMLTSMCM will be introduced from two points. Compared with TWSVM, the number of variables will be 2K (m + n) +2 reduced from 2 (mn + 1). In addition, reducing the variables can help ameliorate over fitting and improve classification accuracy.

2.5.2 Regularization

For the significance of regularization terms, two points as well compare with TWSVM will be stated. If there were no regularization terms $\frac{C_{3}}{2} (∥ U_{1} V_{1}^{T} ∥^{2} + b_{1}^{2})$ and $\frac{C_{4}}{2} (∥ U_{2} V_{2}^{T} ∥^{2} + b_{2}^{2})$ with the pair of the optimization problems (10),(11), they will be reformulated as follows: $\begin{matrix} min_{α} & \frac{1}{2} α^{T} G (H^{T} H)^{- 1} G^{T} α - e_{2}^{T} α, \\ s . t . & e_{2}^{T} α \geq 0, \\ 0 \leq α \leq C_{1} e_{2}, \end{matrix}$ (35)and $\begin{matrix} min_{γ} & \frac{1}{2} γ^{T} H (G^{T} G)^{- 1} H^{T} γ - e_{1}^{T} γ, \\ s . t . & e_{1}^{T} γ \geq 0, \\ 0 \leq γ \leq C_{2} e_{1} . \end{matrix}$ (36)

In order to ensure the optimization problems (35)–(36) are solvable, namely avoid the matrices both H ^T H and G ^T G are singular. An usual way is to approximately replace the inverse matrices (H ^T H) ^-1 and (G ^T G) ^-1 by (H ^T H) ^-1 + ∈I and (G ^T G) ^-1 + ∈I respectively, where I is an identity matrix of appropriate dimensions, and ∈ is a positive scale, small to keep the structure of data. In the QP problems (28), regularization term $\frac{C_{3}}{2} (∥ U_{1} V_{1}^{T} ∥^{2} + b_{1}^{2})$ can also carry this point. Further on, there exists an essential difference in their significance [22]. The parameters C ₃, C ₄, will play a role as a weighting factor which determines the tradeoff between the regularization term and the empirical risk. So, selecting suitable parameters C ₃, C ₄ reflect the structure risk minimization principle, meanwhile, improves the classification accuracy.

2.5.3 Convergence behaviour

Three propositions about convergence behaviour are going to be given in the following section, to support Algorithm 1. Proposition 1 and Proposition 2 will be given in the similar way in [7], see the details proving process in Appendix. Proposition 3 leads to the stop condition of Algorithm 1.

Proposition 1. When V ₁ is fixed, the result U ₁ and b ₁, transformed from the solution of QP problem (19), are the global solution to the optimization problem (10). Similarly, when U ₁ is fixed, the result V ₁ and b ₁, are the global solution to the optimization problem(10).

Proposition 2. The iterative procedure ofAlgorithm 1, will find the solution that satisfies the constrains in the optimization problem (10), and monotonically decreases the objective function of the optimization problem (10).

Proposition 3. The objective function in the optimization problem (10) will be convergent when $U_{1} V_{1}^{T}$ is convergent.

Proof of Proposition 3. For the equation $Tr (U^{T} X_{i} V) = Tr ({UV}^{T} X_{i}) = (Vec ({UV}^{T})^{T} Vec (X_{i}),$ $Vec (U_{1} V_{1}^{T})$ could be seemed as projecting vector and then the optimization problem (10) is equivalent to traditional SVM formulation as follows

$\begin{matrix} min_{U_{1}, V_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} ((Vec (U_{1} V_{1}^{T})^{T} Vec (X_{i}) + b_{1})^{2} +, \\ C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ Vec (U_{1} V_{1}^{T}) ∥^{2} + b_{1}^{2}), \\ s . t . & - ((Vec (U_{1} V_{1}^{T})^{T} Vec (X_{i}) + b_{1}) \geq, \\ 1 - ξ_{j}, ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t . \end{matrix}$ (37)

Obviously, when $Vec (U_{1} V_{1}^{T})$ is convergent, the objective function of the QP problem (37) is convergent, while the optimization problem (10) is equivalent to the optimization problem (37), the objective function of the optimization problem (10) is convergent too. The convergence speed of the objective function of the optimization problem (10), is closely related to the number of iterations. More details will be analyzed in the Experiments Section.

2.6 Extension

In many cases, the input is more likely represented as a matrix or a high order tensor. For example, colour image are naturally represented as a three order tensor. For the n order tensor $X = \prod_{j = 1}^{n} \otimes u^{(j)}$ , we define F - norm and model product. Let

$T = {(X_{1}, y_{1}), \dots, (X_{l}, y_{l}), \dots, (X_{t}, y_{t})},$ (38)be a set of tensor sample data(training data), where $X_{i} \in R^{m_{1} \times m_{2}, \dots, \times m_{n}}$ is the input, y _i ∈ {-1, 1} is the corresponding output, and the first l points belong to the positive class and the rest points belong to the negative class, that is, y _i = 1, i = 1, ⋯, l, y _i = -1, i = l + 1, ⋯, t.

$∥ \prod_{j = 1}^{n} \otimes u^{(j)} ∥^{2} = \sum_{p_{1} = 1}^{m_{1}} \sum_{p_{2} = 1}^{m_{2}} \dots \sum_{p_{n} = 1}^{m_{n}} (u_{p_{1}}^{(1)} u_{p_{2}}^{(2)} \dots u_{p_{n}}^{(n)})^{2}$ $= ∥ Vec (\prod_{j = 1}^{n} \otimes u^{(j)}) ∥^{2}$ and $X \times_{1}^{(1)} \times_{2} u^{(2)} \dots \times_{n} u^{(n)}$ $= \sum_{p_{1} = 1}^{m_{1}} \sum_{p_{2} = 1}^{m_{2}} \dots \sum_{p_{n} = 1}^{m_{n}} X_{p_{1}, p_{2}, \dots, p_{n}} u_{p_{1}}^{(1)} u_{p_{2}}^{(2)} \dots u_{p_{n}}^{(n)}$

$= (Vec (\prod_{j = 1}^{n} \otimes u^{(j)}))^{T} Vec (X)$ , where ×_j is the model-j time between a tensor and a vector, Vec (·) is the vectorization of a tensor.See more details in [23].

The classification model for the above training set (38) corresponding to the optimization problems (10)–(11) are represented as follows

$\begin{matrix} min_{u_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (X_{i} \times_{1} u_{1}^{(1)} \times_{2} u_{1}^{(2)} \dots \times_{n} u_{1}^{(n)} + b_{1})^{2}, \\ + C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ \prod_{j = 1}^{n} \otimes u_{1}^{(j)} ∥^{2} + b_{1}^{2}), \\ s . t . & - (X \times_{1} u_{1}^{(1)} \times_{2} u_{1}^{(2)} \dots \times_{n} u_{1}^{(n)} + b_{1}) \geq, \\ 1 - ξ_{j}, ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}$ (39)and

$\begin{matrix} min_{u_{2}, b_{2}, ξ_{i}} & \frac{1}{2} \sum_{j = l + 1}^{t} (X_{j} \times_{1} u_{2}^{(1)} \times_{2} u_{2}^{(2)} \dots \times_{n} u_{2}^{(n)} + b_{2})^{2}, \\ + C_{2} \sum_{i = 1}^{l} ξ_{i} + \frac{C_{4}}{2} (∥ \prod_{i = 1}^{n} \otimes u_{2}^{(j)} ∥^{2} + b_{2}^{2}), \\ s . t . & - (X \times_{1} u_{2}^{(1)} \times_{2} u_{2}^{(2)} \dots \times_{n} u_{2}^{(n)} + b_{2}) \geq, \\ 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, 2, \dots, l, \end{matrix}$ (40)

where $u_{1} = u_{1}^{1}, u_{1}^{2}, \dots, u_{1}^{n}$ and $u_{1} = u_{2}^{1}, u_{2}^{2}, \dots, u_{2}^{n}$ And we will get the decision function

$f (X) = \arg \min_{r = 1, 2} \frac{| X_{i} \times_{1} u_{r}^{(1)} \times_{2} u_{r}^{(2)} \dots \times_{n} u_{r}^{(n)} + b_{r} |}{∥ \prod_{j = 1}^{n} \otimes u_{r}^{(j)} ∥ .}$ (41)

3 Numerical Experiments

In this section, different benchmark data sets are trained by our algorithm introduced in Section 2. Besides, many other methods are compared with MRMLSTMCM in classification accuracy and computational complexity. Classification accuracy, convergence behaviour and the CPU computational time are used to evaluate our method. Finally, some experimental results with different parameter K are given.

3.1 Data Sets and Evaluation Index

Ten data sets, from the UCI Repository of machine learning database [20], are Sonar, CMC, Hill-vally, Ionosphere, Madelon, Pedestrian, Pollen, FingerDB, Binucleatet and RGB respectively. In the first five data sets, the inputs are vector, and in the rest are matrix data sets. Especially, Binucleatet is high dimensional matrix data set and RGB is matrix structure image data set. The detail of ten data sets are given in Table 1.

Table 1
The detail characteristic of the data sets

Dataset #Size(t) #Scale #Class number

Sonar 208 60 2

CMC 1473 9 2

Hill-vally 606 100 2

Ionosphere 350 34 2

Madelon 2000 500 2

Pedestrian 2000 48 × 96 2

Pollen 630 25 × 25 7

FingerDB 80 300 × 300 10

Binucleate 40 1204 × 1280 2

RGB 443 720 × 960 40

Dataset	#Size(t)	#Scale	#Class number
Sonar	208	60	2
CMC	1473	9	2
Hill-vally	606	100	2
Ionosphere	350	34	2
Madelon	2000	500	2
Pedestrian	2000	48 × 96	2
Pollen	630	25 × 25	7
FingerDB	80	300 × 300	10
Binucleate	40	1204 × 1280	2
RGB	443	720 × 960	40

#Size is the number of the data points; #Scale is the dimension of the inputs; #Class number is the number of class.

All methods are implemented by using MATLBAR2013b on a PC with an Intel(R)(2.83Hz) with 2GB RAM, respectively.

The ‘Accuracy’, denoting the accuracy of classification result, is used to evaluate methods is defined as follows:

$Accuracy = (TP + TN) / (TP + FP + TN + FN),$ (42)where TP, TN, FP and FN are the numbers of true positive, true negative, false positive, and false negative, respectively. Classification accuracy of each method is measured by the five-fold cross-validation methodology. The training CPU time is used to evaluate the computational complexity of methods. The parameter C ₁, C ₂, C ₃, C ₄ are selected from the set {10^-i, i = -7, - 6, ⋯, 7}, and C ₁ = C ₂, C ₃ = C ₄.

3.2 Accuracy

In order to compare the classification accuracy of MRMLTSMCM with SVM, MRMLSVM, TWSVM, TSTM, two groups of data set, vector and matrix data set are chosen. The results of numerical experiments are summarized in Table 2 and 3, respectively. And the inputs of the vector data sets are reformulated into the matrix inputs by the function Mat (·) when the MRMLSVM, TSTM and MRMLTSMCM are implemented (In Appendix, the variation of scale for the vector data sets in Table 6).

Table 2
The classification accuracies for the vector data set, black font denotes best accuracy

Data set L-SVM Accuracy % Time(s) MRMLSVM Accuracy % Time(s) TWSVM Accuracy % Time(s) TWSTM Accuracy % Time(s) MRMLTSMCM Accuracy % Time(s)

Sonar 51.88±0.1300 65.18±0.1400 74.28±0.0177 74.50±0.0317 77.08±0.0303

2.356484 29.340054 0.811871 11.942495 13.804657

CMC 67.25±0.0584 67.78±0.0075 67.00±0.0034 67.31±0.0016 68.28±0.0021

0.204572 2.054612 0.138147 0.719547 1.513576

Hill-vally 47.73±0.0696 48.27±0.0826 42.90±0.0347 59.40±0.0894 61.67±0.0562

0.509951 6.0417437 0.143041 0.771425 2.851150

Ionosphere 57.17±0.0103 57.59±0.0273 57.09±0.0292 57.93±0.0117 61.05±0.0155

0.287757 3.022703 0.154994 0.789471 1.333612

Madelon 49.97±0.0002 52.13±0.0825 49.80±0.0056 50.53±0.0451 52.53±0.0798

2.401572 6.940356 1.389760 8.102103 14.079382

Data set	L-SVM Accuracy % Time(s)	MRMLSVM Accuracy % Time(s)	TWSVM Accuracy % Time(s)	TWSTM Accuracy % Time(s)	MRMLTSMCM Accuracy % Time(s)
Sonar	51.88±0.1300	65.18±0.1400	74.28±0.0177	74.50±0.0317	77.08±0.0303
	2.356484	29.340054	0.811871	11.942495	13.804657
CMC	67.25±0.0584	67.78±0.0075	67.00±0.0034	67.31±0.0016	68.28±0.0021
	0.204572	2.054612	0.138147	0.719547	1.513576
Hill-vally	47.73±0.0696	48.27±0.0826	42.90±0.0347	59.40±0.0894	61.67±0.0562
	0.509951	6.0417437	0.143041	0.771425	2.851150
Ionosphere	57.17±0.0103	57.59±0.0273	57.09±0.0292	57.93±0.0117	61.05±0.0155
	0.287757	3.022703	0.154994	0.789471	1.333612
Madelon	49.97±0.0002	52.13±0.0825	49.80±0.0056	50.53±0.0451	52.53±0.0798
	2.401572	6.940356	1.389760	8.102103	14.079382

For each data set, there are two corresponding rows, the first row means the classification accuracies by different methods, the second row means the training time, and the L-SVM and TSVM record one CPU time, while MRMLSVM, TSTM, MRMLS- TMCM record one CPU time with 5 times iterations of U ₁, U ₂, V ₁, V ₂ and under the condition K = 2. It’s similar for Table 3.

Table 3

The classification accuracies for the matrix data set, black font denotes best accuracy

Data set	L-SVM Accuracy % Time(s)	MRMLSVM Accuracy % Time(s)	TWSVM Accuracy % Time(s)	TWSTM Accuracy % Time(s)	MRMLTSMCM Accuracy % Time(s)
Pedestrian	49.60±2.6400	52.95±0.0574	49.87±0.0682	56.27±0.0753	61.75±0.0963
	0.31114	7.736319	7.753787	5.118282	28.499343
Pollen(1vs2)	96.67±0.0556	96.87±0.0709	82.19±0.1200	96.61±0.1200	97.13±0.0869
	1.44638	17.430457	20.3303	10.58888	17.696126
Pollen(2vs7)	69.19±0.0328	75.31±1.0300	79.81±0.0299	88.54±0.0244	89.07±0.0130
	1.029090	18.173509	0.331786	10.882776	22.341175
FingerDB	53.44±0.3465	50.83±0.0694	□	93.79±0.0000	98.13±0.0820
	0.5167	8.578828	□	2.671822	26.159872
Binucleate	100±0.0000	100±0.0000	100±0.0000	100±0.0000	100±0.0000
	0.52220	48.8520	10.406616	5.743396	22.872993

For Pollen data set, we choose 1-th class vs 2-th class, and 2-th class vs 7-th class. For FingerDB data set, we choose 1-th class vs 8-th class. For Binucleate data set, we do singular value decomposition for each data point firstly, and then get the singular values σ ₁, σ ₂, ⋯ the new input is a diagonal matrix formed by the ahead 100 singular values. □ means this experiment can not be done under the prevailing condition.

In Table 2, the classification results on five vector data sets in five different methods are listed, and the best accuracy is shown by boldface. It is obvious that the accuracy of our method MRMLTSMCM is much better than the other methods on vector data sets. In the process of matrixing, to discovery the influence for accuracy by the different way offolding vector into matrix, we choose two data sets, Sonar and Hill-vally, to train in different ways, respectively. The results are respectively listed in Table 4 and 5. From the two tables, it can be found that different folding ways would affect the classification accuracy when compared each row, and little difference between the best accuracy of eachcolumn.

Table 4

The classification accuracies for Sonar in different folding ways

Folding way	4 × 15 Accuracy %	10 × 6 Accuracy %	12 × 5 Accuracy %
K = 1	70.41±0.0740	74.49±0.0317	76.66±0.0155
K = 2	76.86±0.0359	77.38±0.0303	77.17±0.0187
K = 3	76.44±0.0421	79.79±0.0355	77.55±0.0235
K = 4	74.55±0.0257	79.78±0.0229	76.77±0.0282
K = 5	–	79.80±0.0460	77.11±0.0389

–means this experiment has not been done

Table 5

The classification accuracies for Hill-vally in different folding ways

Folding way	10 × 10 Accuracy %	20 × 5 Accuracy %	25 × 4 Accuracy %
K = 1	59.40±0.0894	57.10±0.0400	57.00±0.0923
K = 2	61.07±0.0873	57.67±0.0700	43.33±0.1100
K = 3	61.67±0.0562	55.20±0.1200	63.63±0.0678
K = 4	62.27±0.0566	59.27±0.0300	60.50±0.0683

In Table 3, the accuracy of MRMLTSMCM is as well better than the other methods on all data sets and we obtain an obviously better result than SVM and MRMLSVM for FingerDB. Especially, for Binucleate, do singular value decomposition for each data point firstly, and then get the singular values σ ₁, σ ₂, ⋯ the new input is a diagonal matrix formed by the ahead 100 singular values. For the new data sets, every method obtain 100% accuracy.

The training time on the several data sets with different scales is listed in Table 2 and 3. Where L-SVM and TSVM record one CPU time, while MRMLSVM, TSTM, MRMLTSMCM record one CPU time with 5 times iterations of U ₁, U ₂, V ₁, V ₂ and under the condition K = 2. The CPU time for MRMLSVM, TSTM, MRMLTSMCM is compared and analyzed in follows. From Table 2, the CPU time of MRMLSVM is the most except for Madelon data set, however in Table 3, the CPU time of MRMLSTMCM more than MRMLSVM except for Binucleate data set. There is no doubt that the performance of TSTM is the best.

Finally, for the data set RGB is a set of 40 different classes of colour leaf images. When each image is digitized, a 3-order tensor is obtained. Three sub matrices will be obtained by unfolding the tensor, the first matrix is chosen as training point. Then, we reduce the dimensions of each image by reducing pixel size of image from 720 × 960 into 50 × 66. Seven classes leaf image sets are selected to experiment, we choose only one in each classes and display them in turn in Fig. 1. They are the 11-th(Acer palmaturu), the 18-th(Papaver sp), the 20-th(Pinus sp), the 31-th(Podocarpus sp), the 34-th(Pseudosasa japonica), the 35-th(Magnolia grandiflora) and the 39-th(Schinus terebinthifolius), respectively. Except the 11-th class is obvious broad leaf, the other classes are not broad leaves, and they trend gradually to be broad from the 20-th class to the 39-th class. The results are showed in Fig. 2. From the figure, it can be seen that with difference between classes decreasing, accuracy will cut down too, however keep above 80%.

Fig.1

They are seven different classes of colour leaf images from the 11-th(Acer palmaturu), 18-th(Papaver sp), the 20-th(Pinus sp), the 31-th(Podocarpus sp), the 34-th(Pseudosasa japonica), the 35-th(Magnolia grandiflora) and the 39-th(Schinus terebinthifolius), respectively.

Fig.2

Accuracy-axis represents the classification accuracies for different kinds of leaf image from the data sets RGB, where the 11-th class is obvious broad leaf, the other classes are needle, and trend gradually to the 18-th class from the 20-th class to the 39-th class. The K-axis represents the rank of U ₁, U ₂ and V ₁, V ₂.

3.3 Convergence Behaviour

In this section, some convergence behaviours will be illustrated, corresponding to the accuracy experiments. From the Proposition 3, we know that the objective function of the optimization problem (10) will be convergent when $U_{1} V_{1}^{T}$ is convergent, and similarly with the optimization problem (11). In order to show the convergence behaviours of the optimization problems (10) and (11), ${ERR}_{1} = ∥ U_{1}^{(s + 1)} {V_{1}^{(s + 1)}}^{T} - U_{1}^{(s)} {V_{1}^{(s)}}^{T} ∥$ and ${ERR}_{2} = ∥ U_{2}^{(s + 1)} {V_{2}^{(s + 1)}}^{T} - U_{2}^{(s)} {V_{2}^{(s)}}^{T} ∥$ are used to reflect the convergent tendencies. Twelve convergence behaviours experiments, the first ten are corresponding to the accuracy experiments listed in Tables 2 and 3, and the last two are about RGB data set. All the experimental results are depicted in Fig. 3-5, presented different kinds of data sets, respectively. Where ERR-axis denotes the value of ERR ₁ or ERR ₂, and ‘+’, ‘o’ are used to denote ERR ₁ and ERR ₂, respectively. The s-axis denotes the number of iterations. From Fig. 3-5, it can be seen that all the iterations are convergent and mostly convergent nearly s = 5.

Fig.3

The convergence behaviours, about $U_{1} V_{1}^{T}$ and $U_{2} V_{2}^{T}$ , corresponding to the accuracy experiments for CMC, Hill-vally, Ionosphere, Sonar data set, respectively. The s-axis represents number of iterations and the ERR-axis represents ${ERR}_{1} = ∥ U_{1}^{(s + 1)} {V_{1}^{(s + 1)}}^{T} - U_{1}^{(s)} {V_{1}^{(s)}}^{T} ∥$ and ${ERR}_{2} = ∥ U_{2}^{(s + 1)} {V_{2}^{(s + 1)}}^{T} - U_{2}^{(s)} {V_{2}^{(s)}}^{T} ∥$ , denoted by ‘+’ and ‘o’, respectively. It’s similar for Fig. 4-5.

Fig.4

The convergence behaviours, about $U_{1} V_{1}^{T}$ and $U_{2} V_{2}^{T}$ , corresponding to the accuracy experiments for Madelon, Pedestrian, Pollen (1vs2 and 2vs7) data set, respectively.

Fig.5

The convergence behaviours, about $U_{1} V_{1}^{T}$ and $U_{2} V_{2}^{T}$ , corresponding to the accuracy experiments for FingerDB, Binucleate, RGB ((5,9,11) vs (18,20,31) and (11 vs 18)) data set, respectively.

3.4 Parameter Selection

By the analysis in the Secetion 2.5, the parameter K and C _i, i = 1, 2, 3, 4 are related to the performance of MRMLTSMCM. In following parts, we simply analyze the selection for parameter K. Hill-vally, Pedestrian, Pollen(1 vs 2) and RGB(18 vs 31), four different date sets are selected to train. The parameter K is varied from 1 to 10. We use 5-cross validation to choose parameter C _i, i = 1, 2, 3, 4 with best classification accuracy. Finally, 15 independent runs and the average classification accuracy are shown in Fig. 6. As seen from the results in Fig. 6, With different parameter K, the accuracy would be different, and the best accuracy corresponding to parameter K are 6, 3, 5, 5, respectively. The results reflect the analysis in Section 6.1, the suitable parameter K reduces the variables can help ameliorate over fitting then improve classification accuracy.

Fig.6

Classification accuracies with different parameter K on Hill-vally, Pedestrian, Pollen (1 vs 2), RGB (18 vs 31) data set, respectively.Where Accuracy-axis represents the accuracies of classification results and the K-axis represents the number of parameter ‘K’.

4 Conclusion and Discussion

In this paper, a novel method for matrix data classification is proposed, called as multiple rank multi-linear twin support matrix machine (MRMLTSMCM), which is an extension of TWSVM. In addition, a regularization term is used to improve the performance of MRMLTSMCM. For different kinds of data sets, MRMLTSMCM gets the best classification accuracies compared with some methods. In addition, some experimental results have been proposed to show the effectiveness of our method in convergence behavior, computation time and parameter determination. Nonlinear cases and tensor data classification will be searched in further works.

Appendix

{ Firstly, the proof of Proposition 1 and proposition 2 is given. Proof of Proposition 1. When V ₁ is fixed, the optimization problem (10) is equivalent to the QP problem (19). For (19), seem as a formulation of traditional SVM, its global solution can be obtained by solving its dual problem. Similarly, when U ₁ is fixed, for the optimization problem (10) its global solution can be obtained by solving its dual problem. Proof of Proposition 2. Assume that

U_{s}^{(s)}

had obtained in the s-th iteration. Updating V ₁, ξ and b ₁ by solving the optimization problem (10), there are the following results

\begin{matrix} T_{s} = \arg min_{V_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (Tr ((U_{1}^{(s)})^{T} X_{i} V_{1}) + b_{1})^{2} +, \\ C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ U_{1}^{(s)} V_{1}^{T} ∥^{2} + b_{1}^{2}), \\ s . t . & - (Tr ((U_{1}^{(s)})^{T} X_{j} V_{1}) + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}

(43)where

T_{} = {V_{1}^{(s)}, ξ_{j}^{(s)},_{1}^{(s, 2)}} b_{1}^{(s)}

. In the (s + 1)-th iteration, fixxing V ₁ as

V_{1}^{(s)}

and updating U ₁, ξ and b ₁ by solving the optimization problem (10), there are the following results

\begin{matrix} R_{s} = \arg min_{U_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (Tr (U_{1}^{T} X_{i} V_{1}^{(s)}) + b_{1})^{2} +, \\ C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ U_{1} (V_{1}^{(s)})^{T} ∥^{2} + b_{1}^{2}), \\ s . t . & - (Tr ((U_{1}^{+})^{T} X_{j} V_{1}^{(s)}) + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}

(44)where

ℛ_{s} = U_{1}^{s + 1}, ξ_{j}^{s + 1}, b_{1}^{s + 1, 1}

. Similarly, fixxing U ₁ as

U_{1}^{(s + 1)}

, and optimizing V ₁ and b ₁ by solving the optimization problem (10), there are the following results

\begin{matrix} T_{s + 1} = \arg min_{V_{1}, b_{1}, ξ_{j}} & \frac{1}{2} \sum_{i = 1}^{l} (Tr ((U_{1}^{(s + 1)})^{T} X_{i} V_{1}) + b_{1})^{2} \\ + C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ U_{1}^{(s + 1)} V_{1}^{T} ∥^{2} + b_{1}^{2}), \\ s . t . & - (Tr ((U_{1}^{(s + 1)})^{T} X_{j} V_{1}) + b_{1}) \geq 1 - ξ_{j}, \\ ξ_{j} \geq 0, j = l + 1, l + 2, \dots, t, \end{matrix}

(45)From Proposition 1, obviously,

U_{1}^{(s)}, V_{1}^{(s)}, ξ_{j}^{(s)}

and

U_{1}^{(s + 1)}

V_{1}^{(s + 1)}

ξ_{j}^{(s + 1)}

are both global solutions to the optimization problem (10). combining (43),(44) and (45), we have the following inequality

\begin{matrix} \frac{1}{2} \sum_{i = 1}^{l} (Tr ((U_{1}^{(s + 1)})^{T} X_{i} V_{1}^{(s + 1)}) + b_{1}^{(s + 1, 2)})^{2} +, \\ C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ U_{1}^{(s + 1)} (V_{1}^{(s + 1)})^{T} ∥^{2} + (b_{1}^{(s + 1, 2)})^{2}), \\ \leq \frac{1}{2} \sum_{i = 1}^{l} (Tr ((U_{1}^{(s + 1)})^{T} X_{i} V_{1}^{(s)}) + b_{1}^{(s + 1, 1)})^{2} +, \\ C_{1} \sum_{j = l + 1}^{t} ξ_{j} + \frac{C_{3}}{2} (∥ U_{1}^{(s + 1)} (V_{1}^{(s)})^{T} ∥^{2} + (b_{1}^{(s + 1, 1)})^{2}), \\ \leq \frac{1}{2} \sum_{i = 1}^{l} (Tr ((U_{1}^{(s)})^{T} X_{i} V_{1}^{(s)}) + b_{1}^{(s, 2)})^{2} + C_{1} \sum_{j = l + 1}^{t} ξ_{j}, \\ + \frac{C_{3}}{2} (∥ U_{1}^{(s)} (V_{1}^{(s)})^{T} ∥^{2} + (b_{1}^{(s, 2)})^{2}) . \end{matrix}

(46)It indicates the decrease of objective function during iteration. Next, Table 6 is the folding way of five vector data sets.

Table 6

The variation of Scale for the vector data sets

Dataset	#Prime Scale	#Matrix Scale
Sonar	60	10 × 6
CMC	9	3 × 3
Hill-vally	100	10 × 10
Ionosphere	34	7 × 5
Madelon	500	20 × 25

#Prime Scale is the dimension of vector data; #Matrix Scale is the Scale after folding.

Declaration

The authors declare no competing financial interest.

References

Pirsiavash

, Ramanan

and Fowlkes

C.C.

, Bilinear classifiers for visual recognition[C], Advances in Neural Information Processing Systems (2009), 1482–1490.

Russakovsky

, Deng

, Su

., Imagenet large scale visual recognition challenge[J], International Journal of Computer Vision 115(3) (2015), 211–252.

Qiu

, Chen

, Liu

., A fast ?1-solver and its applications to robust face recognition[J], Journal of Industrial and Management Optimization (JIMO) 8 (2012), 163–178.

Kong

, Zhang

and Kamel

, A survey of palmprint recognition[J], pattern recognition 42(7) (2009), 1408–1418.

Koo

J.J.

, Evans

A.C.

and Gross

W.J.

, 3-D brain MRI tissue classification on FPGAs[J], IEEE Transactions on Image Processing 18(12) (2009), 2735–2746.

Madheswaran

and Dhas

D.A.S.

, Classification of brain MRI images using support vector machine with various Kernels[J], Biomedical Research 26(3) (2015).

Hou

, Nie

, Zhang

., Multiple rank multi-linear SVM for matrix data classification[J], Pattern Recognition 47(1) (2014), 454–469.

Jolliffe

I.T.

and Cadima

, Principal component analysis: a review and recent developments[J], Phil Trans R Soc A 374(2065) (2016), 20150202.

and Li

, LDA/QR: An efficient and effective dimension reduction algorithm and its theoretical foundation[J], Pattern Recognition 37(4) (2004), 851–854.

10.

Tao

, Li

, Hu

., Supervised tensor learning[C], Data Mining, Fifth IEEE International Conference on IEEE 2005, p. 8.

11.

Tao

, Li

, Wu

., Tensor rank one discriminant analysisąłla convergent method for discriminative multilinear subspace selection[J], Neurocomputing 71(10) (2008), 1866–1882.

12.

Hao

, He

, Chen

., A linear support higher-order tensor machine for classification[J], IEEE Transactions on Image Processing 22(7) (2013), 2911–2920.

13.

Guo

, Huang

, Zhang

., Support tensor machines for classification of hyperspectral remote sensing imagery[J], IEEE Transactions on Geoscience and Remote Sensing 54(6) (2016), 3248–3264.

14.

Wang

Q.Z.

, Kang

W.J.

and Wang

Y.J.

, Support tensor machine image classification algorithm based on tensor principal component analysis[J], J Inf Hiding Multimed Signal Process 7(6) (2016), 1265–1273.

15.

Cortes

and Vapnik

, Support-vector networks[J], Machine Learning 20(3) (1995), 273–297.

16.

Vapnik

, The nature of statistical learning. Springer, 2nd edition 1998.

17.

Rafiee-Taghanaki

, Arabloo

, Chamkalani

., Implementation of SVM framework to estimate PVT properties of reservoir oil[J], Fluid Phase Equilibria 346 (2013), 25–32.

18.

Subasi

, Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders[J], Computers in Biology and Medicine 43(5) (2013), 576–586.

19.

Gao

X.Z.

, Fan

L.Y.

and Xu

H.T.

, Multiple rank multi-linear kernel support vector machine for matrix data classification[J], International Journal of Machine Learning and Cybernetics (2015), 1–11.

20.

Blake

C.L.

and Merz

C.J.

, UCI repository of machine learning databases. Technical Report University of California, Department of Information and Computer Science, Irvine, CA, 1998. Available at http://www.ics.uci.edu/ mlearn/MLRepository.html .

21.

Khemchandani

and Chandra

, Twin support vector machines for pattern classification[J], IEEE Transactions on Pattern Analysis and Machine Intelligence 29(5) (2007).

22.

Shao

Y.H.

, Zhang

C.H.

, Wang

X.B.

., Improvements on twin support vector machines[J], IEEE Transactions on Neural Networks 22(6) (2011), 962–968.

23.

Tao

, Li

, Wu

., Supervised tensor learning[J], Knowledge And Information Systems (2007).