Transductive classification via patch alignment

Abstract

In this paper, a novel approach for transductive classification is proposed. Unlike existing methods that heavily rely on constructing the Laplacian matrix to capture data distribution, the proposed approach takes a unique path. It employs a linear transformation model to create local patches for each data point and then unifies them in an objective function to build the Laplacian matrix. Incorporating this Laplacian matrix into the transductive classification framework allows us to assign optimal class labels globally. The experimental results from toy data and real world databases demonstrate that the proposed approach achieves more efficient and stable performance, while this approach is insensitive to the parameters. Notably, our method exhibits robustness to parameter variations, making it highly adaptable to practical applications.

Keywords

Transductive classification laplacian matrix patch alignment

1. Introduction

Classification is one of the most fundamental research topics in machine learning, with the primary goal of effectively separating data into distinct classes. Over the past few decades, various classification methods have emerged. For instance, the Support Vector Machine (SVM) [6] constructs a hyperplane to maximize the margin between two data sets, Linear Discriminant Analysis (LDA) [21] identifies axes that maximize the separation between different class data points while keeping data from the same class close, and deep learning-based methods [16,24] aim to minimize classification loss through neural networks. In these supervised classification algorithms, the classifier is initially trained using labeled samples and is subsequently employed to predict class labels for unseen test data. However, many scenarios involve the availability of a substantial amount of unlabeled data. For instance, in text classification, users may have access to a vast database of documents, but only a fraction of them are manually labeled.

In order to leverage both the labeled and unlabeled data, researchers have proposed semi-supervised classification methods [22], including the Expectation-Maximization algorithm for semi-supervised generative mixture models [13], Self-training [28], Co-training [3,30], Transductive Support Vector Machine (TSVM) [10,27], and graph-based transductive classification methods [7,25]. These methods can be broadly categorized into two classes: the inductive classification method and the transductive classification method. The inductive classification method aims to induce a decision function with a low classification error rate across the entire sample space, while the transductive classification method directly estimates labels for unlabeled data. TSVM [1] can be seen as a straightforward extension of the maximum margin principle of SVM to unlabeled data. It incorporates two distinct principles: regularization through margin maximization on labeled samples and the cluster assumption through margin maximization on unlabeled samples. The existing graph-based transductive classification methods can be formulated as quadratic optimization problem, in which the Laplacian matrix [2] is constructed to estimate data distribution while considering supervised label information. The Gaussian Laplacian based classification (Lap) [19] is a popular transductive classification method; however, its classification performance is highly sensitive to the bandwidth parameter [23], which is used to construct Laplacian matrix based on Gaussian function. Another related graph-based transductive classification is transductive classification via Local Learning algorithm (LL) [26], which applies the concept of local learning to classify specific data points based on their neighbors.

In graph-based transductive classification methods, the construction of the Laplacian matrix, which estimates the distribution of data samples, plays a crucial role in enhancing performance. The patch alignment framework, introduced in [31], unifies existing data distribution measurements into a comprehensive framework comprising two essential components: local patch construction and global alignment. Drawing inspiration from this framework, this paper introduces a novel approach to transductive classification through patch alignment. Distinguishing itself from Lap, which directly employs the Gaussian kernel to compute the Laplacian matrix, the proposed method utilizes a unified objective function to train a linear transformation model for constructing local patches. Consequently, the Laplacian matrix is constructed through a global alignment process that combines all these patches. In contrast to LL, the proposed approach features a unified objective function, enabling the direct acquisition of the Laplacian matrix in a single step. Classification experiments conducted on toy data and six real-world databases showcase the superiority of the proposed method over LL. Furthermore, compared to Lap, the proposed approach demonstrates robustness to related parameters, rendering it more practical for real-world applications.

The rest of this paper is organized as follows. Section 2 presents some related works. The algorithm of transductive classification via patch alignment is proposed in Section 3. Section 4 gives the experiments and discussions, and Section 5 is the conclusion.

2. Related works

2.1. Transductive classification

According to the different ways of processing unlabeled data, semi-supervised learning is divided into inductive and transductive methods [22]. Moreover, graph-based transductive classification methods have received more attention since their effectiveness in practice [7]. A general way to formulate the transductive classification problem is given as follows.

Given a set of n data $X = {[x_{1}, \dots, x_{l}, x_{l + 1}, \dots, x_{n}]}^{T} \in R^{n \times d}$ with d-dimension features, we assume the corresponding class labels of the first l data $x_{i} (1 ⩽ i ⩽ l)$ are ${q_{1}, \dots, q_{l}}$ with each $q_{i} (1 ⩽ i ⩽ l) \in R^{c \times 1}$ , where c represents the number of class in the data set. In our multi-class setting, only one value in $q_{i}$ equals to 1 and others are 0. If the hth ( $1 ⩽ h ⩽ c$ ) value in $q_{i}$ set to 1, it means the $x_{i}$ is classified into the hth class. The graph based transductive classification algorithms can be formulated as or are closely related to the following quadratic optimization problem: $\begin{matrix} (1) & min_{F \in R^{n \times c}} tr [F^{T} L F + {(F - Q)}^{T} U (F - Q)], \end{matrix}$ where $L \in R^{n \times n}$ is the Laplacian matrix, and $Q = {[q_{1}, \dots, q_{l}, 0, \dots, 0]}^{T} \in R^{n \times c}$ . $F = {[f_{1}, \dots, f_{n}]}^{T} \in R^{n \times c}$ is the real valued matrix whose solution can be easily obtained as: $\begin{matrix} (2) & F = {(L + U)}^{- 1} U Q, \end{matrix}$ where $U \in R^{n \times n}$ is a diagonal matrix. As elaborated in [26], its ith diagonal element $u_{i}$ is assigned a large value $10^{4}$ for labeled data $(1 ⩽ i ⩽ l)$ , and $u_{i}$ is assigned to a small value $10^{- 6}$ for the unlabeled data $(l + 1 ⩽ i ⩽ n)$ . It means $f_{i}$ must strictly equal to $q_{i}$ for $1 ⩽ i ⩽ l$ , while there is no constraints on $f_{i}$ for $(l + 1 ⩽ i ⩽ n)$ . After calculating F, the predicted class labels for the unlabeled data $x_{i} (l + 1 ⩽ i ⩽ n)$ can be obtained: if the hth value in $f_{i}$ is the largest, then the data is $x_{i}$ classified into the hth class. The hth value of $q_{i}$ is set to 1 and others are 0. For solving the transductive classification problem, the key part is the first term in Eq. (1), the Laplacian matrix, which specifies the desirable properties of F.

To compute the Laplacian matrix L in Lap [19], a weighted k-nearest neighbor graph of n nodes is built, the corresponding Laplacian matrix is computed as: $\begin{matrix} (3) & L = D - W, \end{matrix}$ where the adjacency matrix W is calculated by the Gaussian kernel as: $\begin{matrix} (4) & W_{i j} = \{\begin{array}{ll} exp (- \frac{1}{t} ‖ x_{i} - x_{j} ‖^{2}), & x_{i} and x_{j} are connected \\ 0, & otherwise, \end{array} \end{matrix}$ where t is the bandwidth parameter. Node $x_{i}$ and $x_{j}$ are connected if $x_{i}$ is among the k-nearest neighbor of $x_{j}$ or $x_{j}$ is among the k-nearest neighbor of $x_{i}$ . In Eq. (3), $D \in R^{n \times n}$ is a diagonal matrix, and $D_{i i} = \sum_{j} W_{i j}$ . The Normalized Laplacian matrix (NLap) [33] is also widely used in transductive classification algorithms, and it is calculated as: $\begin{matrix} (5) & L_{n} = I - D^{- \frac{1}{2}} W D^{- \frac{1}{2}}, \end{matrix}$ where I is the identity matrix, while W and D are the same as in Eq. (3).

Another transductive classification method, the LL [26] algorithm formulates a Laplacian matrix which leads to a solution with the property that the label of each data can be well predicted based on its neighbors and their labels. This method adopts two steps to learn the Laplacian matrix. In the first step, a local model is learnt and in the second step, the sum of local prediction errors of all the data is minimized to compute the Laplacian matrix. Based on the above, Lap, NLap and LL are selected as the comparison algorithms in this paper.

2.2. Patch alignment framework

From the above elaborations, it shows the Laplacian matrix, which reveals the intrinsic structure of the data distribution, is critical for the performance of the transductive classification algorithms. Many research works have demonstrated that the local manifold structure is more important than the global one [14] during the estimation of data distribution, and some research works [4,26,29] report that the local learning algorithms often outperform global ones. However, the global information [33] should also be considered for the estimation of the data structures.

In order to explicitly explain the relationship between the “Local” structure and “Global” structure, the patch alignment [31] framework is proposed to unify the existing data distribution measurements into two parts: patch optimization and whole alignment. The local patch can be constructed differently according to specific applications, while the whole alignment is shared by all methods. And it was used to unify various spectral analysis-based dimensionality reduction algorithms, such as ISOMAP [20], LPP [9], LTSA [32]. Additionally, in [29], a ranking algorithm based on local regression and global alignment is proposed for cross media retrieval. However, the proposed method focuses on the problem of transductive classification. And it is built based on the patch alignment strategy for dimension reduction [31].

Given a data set X with n data samples $x_{i} \in R^{d \times 1}$ , each data sample $x_{i}$ and its k related data $x_{i_{1}}, \dots, x_{i_{k}}$ form the matrix $X_{i} = [x_{i}, x_{i_{1}}, \dots, x_{i_{k}}] \in R^{d \times (k + 1)}$ to denote the local patch for $x_{i}$ . For $X_{i}$ , we have a patch mapping $X_{i} \leftarrow Y_{i}$ , and $Y_{i} = [y_{i}, y_{i_{1}}, \dots, y_{i_{k}}] \in R^{d \times (k + 1)}$ represents the data’s coordinates in the new feature space. This patch optimization is defined as in [31]: $\begin{matrix} (6) & arg min_{Y_{i}} tr (Y_{i} L_{i} Y_{i}^{T}) . \end{matrix}$

Since different data distribution measurements have their own mapping rules, $L_{i}$ will vary in different algorithms. Here all $Y_{i}$ can be unified as a whole, so it can be assumed that the coordinate for the ith patch $Y_{i} = [y_{i}, y_{i 1}, \dots, y_{i k}]$ is selected from the global coordinate $Y = [y_{1}, \dots, y_{n}]$ , such that: $\begin{matrix} (7) & Y_{i} = Y S_{i}, \end{matrix}$ where $S_{i}$ is the selection matrix [31]. The process of whole alignment is conducted by applying Eq. (7) into Eq. (6), and summing up all the patch optimizations to construct the Laplacian matrix L: $\begin{matrix} (8) & arg min_{Y_{i}} tr (Y_{i} L_{i} Y_{i}^{T}) = arg min_{Y} tr (Y L Y^{T}) . \end{matrix}$

3. Transductive classification via patch alignment

In this section, we propose a novel transductive classification method, in which the Laplacian matrix is constructed according to the patch alignment framework [31].

3.1. Local patch construction

Given $X = {[x_{1}, \dots, x_{n}]}^{T} \in R^{n \times d}$ , for each data $x_{i} \in X$ , a local patch $X_{i}$ of the data $x_{i}$ can be constructed by the set of k-nearest neighbors of $x_{i}$ as: $X_{i} = {x_{i}, x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{k}}} \in R^{d \times (k + 1)}$ , in which the neighbors are selected according to the Euclidean Distance between two data. A linear transformation model $o_{i} (x_{j}) = a_{i}^{T} x_{j} + b_{i}$ is adopted as the rules to map the data $x_{j} \in R^{d \times 1}$ in the patch of data $x_{i}$ into the new feature representation $y_{j} \in R^{m \times 1}$ . For the model $o_{i} (\cdot)$ , the subscript i means the model is trained based on the local patch of $x_{i}$ . $a_{i} \in R^{d \times m}$ is the local projection matrix, $b_{i} \in R^{m \times 1}$ is the bias item. We adopt the linear transformation model here, since linear model is fast and more suitable for practical applications, and the local structure of manifold is approximately linear [14].

In order to learn the linear transformation model for each patch, the local prediction error of the model with respect to a single data $x_{j} \in X_{i}$ is given by: $\begin{matrix} (9) & {‖ a_{i}^{T} x_{j} + b_{i} - y_{j} ‖}^{2} . \end{matrix}$

It can be found that the local patch error can be computed by summing the local prediction errors from all the data in $X_{i}$ , which is formulated as: $\begin{matrix} (10) & \sum_{x_{j} \in X_{i}} {‖ a_{i}^{T} x_{j} + b_{i} - y_{j} ‖}^{2} + μ ‖ a_{i} ‖^{2}, \end{matrix}$ where the second term is the regularization term, and is implemented to avoid overfitting. The optimization can be achieved by minimizing the local patch error: $\begin{matrix} (11) & arg min_{Y_{i}, b_{i}, a_{i}} {‖ X_{i}^{T} a_{i} + 1_{k + 1} b_{i}^{T} - Y_{i} ‖}^{2} + μ ‖ a_{i} ‖^{2} . \end{matrix}$

The calculated coordinates for all the data in $X_{i}$ is stored in ${Y_{i} = [y_{i}, y_{i_{1}}, y_{i_{2}}, \dots y_{i_{k}}]}^{T}$ . $1_{k + 1} \in R^{(k + 1) \times 1}$ is with all ones.

In order to assign an optimal $Y_{i}$ for each data, we align all the linear models by summing Eq. (11) over all the data in training set, which is inspired by [29]. We can have the unified objective function: $\begin{matrix} (12) & arg min_{Y_{i}, b_{i}, a_{i}} \sum_{i = 1}^{n} ({‖ X_{i}^{T} a_{i} + 1_{k + 1} b_{i}^{T} - Y_{i} ‖}^{2} + μ ‖ a_{i} ‖^{2}) . \end{matrix}$

To set the derivatives of Eq. (12) to be zero for the variable $b_{i}$ : $\begin{matrix} (13) & b_{i} = \frac{1}{k + 1} (Y_{i}^{T} 1_{k + 1} - a_{i}^{T} X_{i} 1_{k + 1}) . \end{matrix}$ Similarly, to set the derivatives of Eq. (12) to be zero for the variable $a_{i}$ : $\begin{matrix} (14) & a_{i} = {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P Y_{i}, \end{matrix}$ where $P = I - \frac{1}{k + 1} 1_{k + 1} 1_{k + 1}^{T} \in R^{(k + 1) \times (k + 1)}$ is the centering matrix. It can be noted that: $P = P^{T} = P P^{T}$ .

After calculating $a_{i}$ and $b_{i}$ , the linear transformation model for each patch can be learned. By substituting $a_{i}$ and $b_{i}$ , the objective function in Eq. (12) becomes: $\begin{matrix} (15) & arg min_{Y_{i}} \sum_{i = 1}^{n} [{‖ P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P Y_{i} - P Y_{i} ‖}^{2} + μ Y_{i}^{T} P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 2} X_{i} P Y_{i}], \end{matrix}$ and the $L_{i}$ for each local patch $X_{i}$ can be constructed through the optimization as: $\begin{aligned} arg min_{Y_{i}} \sum_{i = 1}^{n} [{‖ P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P Y_{i} - P Y_{i} ‖}^{2} + μ Y_{i}^{T} P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 2} X_{i} P Y_{i}] \\ = arg min_{Y_{i}} \sum_{i = 1}^{n} tr [Y_{i}^{T} {(P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P - P)}^{2} Y_{i} + μ Y_{i}^{T} P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 2} X_{i} P Y_{i}] \\ = arg min_{Y_{i}} \sum_{i = 1}^{n} tr {[Y_{i}^{T} (P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P \\ - 2 P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P + P) Y_{i}] \\ + μ Y_{i}^{T} P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 2} X_{i} P Y_{i}} \\ = arg min_{Y_{i}} \sum_{i = 1}^{n} tr [Y_{i}^{T} (P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P \\ - 2 P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P + P) Y_{i}] \\ (16) & = arg min_{Y_{i}} \sum_{i = 1}^{n} tr [Y_{i}^{T} (P - P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P) Y_{i}], \end{aligned}$ where $L_{i} = P - P X_{i}^{T} {(X_{i} P X_{i}^{T} + μ I)}^{- 1} X_{i} P$ . Then, Eq. (16) is equivalent to the following objective function: $\begin{matrix} (17) & arg min_{Y_{i}} \sum_{i = 1}^{n} tr (Y_{i}^{T} L_{i} Y_{i}) . \end{matrix}$

3.2. Whole alignment

Following the whole alignment tricks presented in Eq. (7) and Eq. (8), the objective function in Eq. (17) becomes $arg {min}_{Y} tr (Y^{T} L Y)$ , in which $L = \sum_{i = 1}^{n} S_{i} L_{i} S_{i}^{T} \in R^{n \times n}$ . The selection matrix $S_{i} \in R^{n \times (1 + k)}$ is constructed as: $\begin{matrix} (18) & {(S_{i})}_{p q} = \{\begin{array}{l} 1, if p = {(v_{i})}_{q} \\ 0, else \end{array} \end{matrix}$ where $v_{i} = [i, i_{1}, i_{2}, \dots, i_{k}]$ denotes the set of indices for the local patch $X_{i}$ . The Laplacian matrix is obtained on an iterative procedure: $\begin{matrix} (19) & L (v_{i}, v_{i}) \leftarrow L (v_{i}, v_{i}) + L_{i}, \end{matrix}$ for $i = 1, \dots, n$ with the initialization $L = 0$ . $L (v_{i}, v_{i})$ is the submatrix constructed by selecting certain rows and columns from L according to the index set $v_{i}$ . Apply this Laplacian matrix into Eq. (1), a novel transductive classification algorithm via patch alignment (TCPA) can be constructed.

3.3. Comparative analysis with the existing method

Connections to Patch Alignment: the patch alignment framework provides an efficient tool to construct the Laplacian matrix in accordance with specific applications. $L_{i}$ encodes the specified mapping rules for the ith patch. By summing over all the $L_{i}$ for each patch through global alignment, the Laplacian matrix L can be obtained. Inspired from this framework, we adopt the linear transformation model as the mapping rules for each local patch. When calculating this model, not only the data of the local patch (Eq. (11)), but also the data of whole set (Eq. (12)) make contribution to calculation. It indicates that our method has considered both the local and global information into patch construction. Through the whole alignment tricks, the Laplacian matrix can be constructed.

Connections to LL: in LL, for each data $x_{i}$ , another local linear model $o_{i}^{'} (x) = a_{i}^{' T} (x - x_{i}) + b_{i}^{'}$ is learnt. Firstly, each local linear model $o_{i}^{'} (\cdot)$ is learnt by minimizing a similar objective function in Eq. (11), and it is applied to predict the value $o_{i}^{'} (x_{i})$ for the sample $x_{i}$ only. $a_{i}^{' T}$ is the weight vector, and $b_{i}^{'}$ is the bias item. Secondly, the sum of local prediction errors of all the data is minimized to compute the Laplacian matrix. It is different from the Laplacian matrix used in TCPA in the following aspects: first, LL adopts a two steps approach to learn a Laplacian matrix and it does not have a unified objective function for optimization. In contrast, TCPA has the unified objection function (Eq. (12)), and the Laplacian matrix can be directly obtained through the whole alignment tricks (Eq. (8)); second, in LL, each local model is only applied to a single data to obtain local prediction error. In contrast, as shown in Eq. (12), we minimize the sum of all the local patch prediction errors. We argue that this error can better characterize the capability of local linear model by counting the errors from all the patches rather than just a single patch. The following experiments demonstrate that TCPA outperforms LL for data classification.

Connections to Lap and NLap: The popular Lap and NLap explicitly emphasize the pairwise similarities [26]. They are usually regarded as putting smoothness constraints on the solution. Here we show that these two methods can also be investigated from the local learning point of view. For Lap, the objective function [34] is: $\begin{matrix} (20) & arg min_{Y} \sum_{i = 1}^{n} \sum {j = 1}^{n} ‖ y_{i} - y_{j} ‖^{2} W_{i j} = arg min_{Y} tr (Y^{T} L Y), \end{matrix}$ where $W_{i j}$ and L are calculated by Eq. (4) and Eq. (3). By setting the gradient $\frac{\partial}{\partial Y} tr (Y^{T} L Y)$ to 0, it can be seen that the optimal Y minimizing Eq. (20) must satisfy the following harmonic property [34]: $\begin{matrix} (21) & y_{i} = \frac{\sum_{x_{j} \in N_{i}} W_{i j} y_{j}}{\sum_{x_{j} \in N_{i}} W_{i j}}, \end{matrix}$ where $N_{i}$ includes k-nearest neighbors for $x_{i}$ . Equation (21) means that Lap expects that $y_{i}$ equals the convex combination of $y_{j}$ for $x_{j} \in N_{i}$ , and the weight of $y_{i}$ is proportional to $W_{i j}$ , which measures the similarity between $x_{i}$ and $x_{j}$ . Similarly, for NLap, we have: $\begin{matrix} (22) & y_{i} = \sum_{x_{j} \in N_{i}} \frac{W_{i j}}{\sqrt{D_{i i} D_{j j}}} y_{j} . \end{matrix}$

Therefore, from the local learning point of view, we can see that Lap, NLap and TCPA differ from each other mainly in their answers to the local learning problem. Clearly their final classification performance heavily relies on whether the corresponding local models can properly estimate $y_{i}$ based on ${(x_{j}, y_{j})}_{x_{j} \in N_{i}}$ . This can be demonstrated in the experimental results.

4. Experiment and discussions

4.1. Databases

Table 1
Databases descriptions

Databases Classes Dims Points

Toy data 3 2 1500

COIL-20 20 1024 1440

3D Model 53 500 1811

ORL 40 1024 400

Yale-B 38 1024 2414

PIE 22 1024 3740

USPS 10 256 4000

Databases	Classes	Dims	Points
Toy data	3	2	1500
COIL-20	20	1024	1440
3D Model	53	500	1811
ORL	40	1024	400
Yale-B	38	1024	2414
PIE	22	1024	3740
USPS	10	256	4000

Fig. 1.

Sample data from 6 databases used in the experiment.

In order to get a good picture of the effectiveness of the algorithms, we compare their generalization performance on toy data and six real world databases with different properties. The toy data contains three classes and each class has 500 samples. The real world databases consist of multi-class problems. In Coil-20 [11], the data are gray-scale images of 20 different objects taken from different angles. The Princeton 3D Model databases [17] contains 1814 models collected from the World Wide Web. For each 3D model, there is an Object File Format (.off) file with the polygonal geometry of the model, and they can be classified into 53 categories according to the model’s semantics. The ORL database [15] contains 10 different images of each of 40 distinct subjects. For some subjects, the images are taken at different times, varying the lighting, facial expressions and facial details. The Yale-B face database [8] contains face images showing varying facial expressions and configurations. The CMU Pose, Illumination, and Expression (PIE) database [18] contains the facial images of the human face taken under 13 different poses, 43 different illumination conditions, and with 4 different expressions. The USPS database contains the well known data on handwritten digit recognition. For the databases of COIL-20, ORL, Yale-B, PIE and USPS, the images are treated as vectors of intensity values. For the 3D model database, the D2 shape distribution descriptor [12] is implemented to build the vectors. The details of these databases used in our experiment are presented in Table 1, and the sample images are shown in Fig. 1.

4.2. Experiments configurations

The parameters configurations for the experiments are listed clearly in Table 2. We first use the toy data to compare the performance of Lap, NLap, LL and TCPA. For Lap and NLap, the performance is influenced by three parameters: number of nearest neighbor (k), bandwidth parameter (t) and labeled samples (l), while LL and TCPA also contain three parameters: number of the nearest neighbor in the local patch (k), regularization parameter (μ) and labeled samples (l). The experiments are separated into three parts, and the results are presented in Fig. 2, 4 and 6, in which the points in three concentric circles represent the three classes in the toy data, and the classification results are shown by the red, green and blue points.

Table 2
Parameters configurations of the transductive methods in experiments

K $t (μ)$ l

Toy data

Experiments of k [ $1, \dots, 25$ ] 1000 91

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 91

Experiments of l 5 1000 [ $1, \dots, 181$ ]

Coil-20

Experiments of k [ $1, \dots, 33$ ] 1000 36

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 36

Experiments of l 5 1000 [ $4, \dots, 36$ ]

3D Model

Experiments of k [ $1, \dots, 19$ ] 1000 20

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 20

Experiments of l 5 1000 [ $2, \dots, 20$ ]

ORL

Experiments of k [ $1, \dots, 10$ ] 1000 10

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 10

Experiments of l 5 1000 [ $2, \dots, 10$ ]

Yale-B

Experiments of k [ $1, \dots, 37$ ] 1000 40

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 40

Experiments of l 5 1000 [ $4, \dots, 40$ ]

PIE

Experiments of k [ $1, \dots, 37$ ] 1000 80

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 80

Experiments of l 5 1000 [ $8, \dots, 80$ ]

USPS

Experiments of k [ $1, \dots, 19$ ] 1000 20

Experiments of $t (μ)$ 5 [ $0.001, \dots, 1000$ ] 20

Experiments of l 5 1000 [ $2, \dots, 20$ ]

	K	$t (μ)$	l
Toy data
Experiments of k	[ $1, \dots, 25$ ]	1000	91
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ]	91
Experiments of l	5	1000	[ $1, \dots, 181$ ]
Coil-20
Experiments of k	[ $1, \dots, 33$ ]	1000	36
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ]	36
Experiments of l	5	1000	[ $4, \dots, 36$ ]
3D Model
Experiments of k	[ $1, \dots, 19$ ]	1000	20
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ]	20
Experiments of l	5	1000	[ $2, \dots, 20$ ]
ORL
Experiments of k	[ $1, \dots, 10$ ]	1000	10
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ]	10
Experiments of l	5	1000	[ $2, \dots, 10$ ]
Yale-B
Experiments of k	[ $1, \dots, 37$ ]	1000	40
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ]	40
Experiments of l	5	1000	[ $4, \dots, 40$ ]
PIE
Experiments of k	[ $1, \dots, 37$ ]	1000	80
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ] 80
Experiments of l	5	1000	[ $8, \dots, 80$ ]
USPS
Experiments of k	[ $1, \dots, 19$ ]	1000	20
Experiments of $t (μ)$	5	[ $0.001, \dots, 1000$ ]	20
Experiments of l	5	1000	[ $2, \dots, 20$ ]

Similarly, the experiments on real world databases are also divided into three parts. As shown in Table 2, in each experiment, we fix two parameters and vary another parameter with different values. For example, in experiments of evaluating the classification performance influenced by l, a random subset with l samples per object (e.g. for Coil-20, $l = [4, \dots, 36]$ ) is taken with labels. The parameters of k and $t (μ)$ are fixed at 5 and 1000 separately. For each given l, we average the classification results (ARs) over 20 random splits. And the accuracy is applied to evaluate the result of each split. Besides these four transductive classification algorithms (TCPA, LL, Lap and NLap), the following classification methods are added into comparison:

In the method of graph based Transductive Support Vector Machine (G-TSVM) [12], the pairwise distances computed by the Gaussian kernel (Eq. (4)) is implemented to reflect the cluster assumption. According to the experiments in [5], bandwidth parameter t for Gaussian kernel can be determined as 1000, which leads only to a minor loss in accuracy. The two penalty parameters p and $p^{*}$ for the labeled and unlabeled samples are determined according to an iterative procedure, which can obtain the optimal values;

Linear Discriminant Analysis (LDA) searches for the project axes on which the data of different classes are far from each other while requiring data of the same class to be close to each other;

Principle Component Analysis (PCA), which maximizes the mutual information between original high-dimensional data, is a linear and unsupervised dimensional reduction method. It can be used for classification just as the LDA;

Locality Preserving Projections (LPP) [9] builds a graph model which reflects the intrinsic geometrical structure of the data space, and finds a projection that respects this graph structure. Since LPP also adopt the Gaussian kernel (Eq. (4)) to construct the adjacency matrix to preserve the graph structure, the two parameters K and t are also fixed at 5 and 1000 in this experiment.

It should be noted that for the dimension reduction methods (LDA, PCA, LPP), the classification is adopted in the low-dimensional space. After obtaining the projection matrix, the original data are projected into the low-dimensional space, and the classification is conducted through K-NN classifier. We have found the reduced dimension can affect the classification results, so the highest ARs that the dimension reduction methods can achieve are presented.

4.3. Performance evaluations

In order to discuss the influence of the parameters in these four methods, and compare their performance on toy data and real world databases, several experiments are conducted on toy data in this section. A total of three parameters are discussed: k, $t (μ)$ and l.

Fig. 2.

Performance comparison of the transductive classification methods on toy data by changing k from 1 to 25. (a) for TCPA, (b) for LL, (c) for Lap, (d) for NLap.)

4.3.1. Influence of the parameter k

First, in order to analyse the influence of the parameter k, a series of experiments are constructed. In this experiment, it is increased from 1 to 25. $t (μ)$ and l are set to be 1000 and 91. The experimental results on toy data is shown in Fig. 2. We observe that when k is increased from 1 to 5, the performance of these four methods is sensitive to k. When k increases from 5 to 25, they obtain stable performance. It indicates that parameter k has similar influence on these four methods.

In addition, to evaluate the performance of the proposed method on real world databases, the classification performances of these four methods with different values of k on six datasets are shown in Fig. 3. For these experiments, it can be observed that that on Princeton 3D Model, ORL Yale-B and PIE datasets, TCPA performs best on classification with varied values of k, and it can keep the ARs in the high and stable level. In Fig. 3(a), the classification performances of these four methods are quite similar on COIL-20 dataset. Additionally, in Fig. 3(f), when k is between 1 and 7, TCPA has surpassed other three methods on USPS dataset.

Fig. 3.

Performance comparison of the transductive classification methods on real world databases. This figure shows the ARs vs. parameter k variations (a) for COIL-20 (b) for Princeton 3D Model (c) for ORL (d) for Yale-B (e) for PIE (f) for USPS.

4.3.2. Influence of the parameter

t (μ)

In order to discuss the influence of $t (μ)$ , the parameters k and l are fixed at 5 and 91. And the experiment results on toy data are shown in Fig. 4. It can be observed that TCPA works well when μ is changed from 0.001 to 1000. Therefore, TCPA is insensitive to μ. Compared with TCPA, LL performs worse on classification, though its performance is also insensitive to μ. Figure 4(c) and (d) show the classification results using Lap and NLap. From these two subfigures, it can be observed that when $t = 0.001$ and $t = 0.01$ , the classification result is not good. However, if we increase t to 1000, the classification results become much better. It indicates that the classification results of Lap and NLap are sensitive to the bandwidth parameter t.

Fig. 4.

Performance comparison of the transductive classification methods on toy data by changing $t (μ)$ from 0.001 to 1000. (a) for TCPA, (b) for LL, (c) for Lap, (d) for NLap.

For real world data, the classification results of these four methods with different values are shown in Fig. 5. From the six subfigures in Fig. 5, it can be observed that the performance of TCPA and LL is more stable than Lap and NLap in a broad range of $t (μ)$ , and it can also be found that TCPA outperforms LL. This indicates that TCPA is more adaptive to classification compared with other three transductive algorithms.

Fig. 5.

Performance comparison of the transductive classification methods on real world databases. This figure shows the ARs vs. parameter $t (μ)$ variations (a) for COIL-20 (b) for Princeton 3D Model (c) for ORL (d) for Yale-B (e) for PIE (f) for USPS.

4.3.3. Influence of the parameter l

In experiment of l, two parameters k and $t (μ)$ are fixed at 5 and 1000, and l is increased from 1 to 181. The experimental results on toy data are shown in Fig. 6. These results show that when l is 1, these four methods have unsatisfactory performance on classifications. In other cases with varied l, Lap, NLap and TCPA have well and stable performance on classification, while LL cannot achieve good classifications until l reaches to 181.

Fig. 6.

Performance comparison of the transductive classification methods on toy data by changing l from 1 to 181. (a) for TCPA, (b) for LL, (c) for Lap, (d) for NLap.

Similarly, Fig. 7 presents ARs versus the labeled samples l on real world databases. In Fig. 7(a), (b), (c) and (e), it can be seen that compared with other methods, TCPA performs best on the classification results in all cases. The ARs of TCPA keep increasing with the enlargement of the labeled samples. Another finding is in Fig. 7 (a), (b), (e) and (f), the ARs of LL’s are evidently lower than TCPA. This indicates that the process of patch alignment has the effect to strengthen TCPA’s performance on classification.

Fig. 7.

Performance comparison of the transductive classification methods on real world databases. This figure shows the ARs vs. parameter l variations (a) for COIL-20 (b) for Princeton 3D Model (c) for ORL (d) for Yale-B (e) for PIE (f) for USPS.

In summary, Fig. 4 shows that TCPA has stable performance influenced by parameter μ, while bandwidth parameter t has strong influence on the performance of Lap and NLap. In addition, the results presented in Fig. 7 show that TCPA has higher ARs than Lap and NLap in most cases. From these two points, it can be demonstrated that the local model used in TCPA is much better than the model used in Lap and NLap (Eq. (21) and Eq. (22)). To compare with LL, the results in Fig. 2, 4, 5, 6, and 7 show that TCPA performs more stably and accurately on classification. It can be demonstrated that the sum of all the local patch prediction errors (Eq. (12)) used in TCPA can assign the values more accurately than using single data to obtain local prediction error in LL.

Additionally, in these six subfigures of Fig. 7, it is obvious that the ARs of the three dimension reduction methods (LDA, PCA, LPP) are lower than the transductive classification algorithms. In PCA and LDA, the low dimensional space is constructed without considering the local geometry, so they cannot discover the nonlinear structure in high dimensional space. For LPP, though it also preserves the local geometry by using the Gaussian Laplacian, it is an unsupervised method. These reasons explain that they have lower performance in classification based on all the six real world databases compared with TCPA.

For G-TSVM, we have found that in Fig. 7(a), (f), it obtains the high ARs, but in Fig. 7(b), (c) and (e), its performance is similar with the dimension reduction methods. In Fig. 7(d), it even has the worst performance on ARs. It indicates that in comparison with TCPA, the G-TSVM has unstable performance on classifications.

5. Conclusions

In this paper, a novel transductive classification algorithm via patch alignment is proposed. This algorithm employs a linear transformation model to construct a local patch, which can predict the values of its neighboring data for each data, and adopts a unified objective function to align the local patches and learn the Laplacian matrix. The results of the experiments conducted on toy data and other six real world databases show that compared with the existing transductive classification algorithms (Lap, NLap, LL, G-TSVM), TCPA is not only having the high and stable performance on classifications, but also is insensitive to the parameters. On the other side, the experimental results show that the transductive classification algorithms outperform the three dimensional reduction methods (PCA, LDA, LPP) in general, since the transductive classification algorithms’ ability in classification have been enhanced by taking both the labeled and unlabeled data into account.

Footnotes

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable and insightful comments on an earlier version of this manuscript. This research was partially supported by JiangSu Collaborative Innovation Center for Building Energy Saving and Construction Technology Projects (No. SJXTBZ2110 and No. SJXTZD2103). It is also supported by the Xuzhou industrial key technology research & development projects (No. KC21108 and No. KC21335).

References

Astorino and

Fuduli, Nonsmooth optimization techniques for semisupervised classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 29(12) (2007), 2135–2142. doi:10.1109/TPAMI.2007.1102.

Belkin and

Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15(6) (2003), 1373–1396. doi:10.1162/089976603321780317.

Blum and

Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, Association for Computing Machinery, New York, NY, USA, 1998, pp. 92–100. ISBN 1581130570. doi:10.1145/279943.279962.

Bottou and

Vapnik, Local learning algorithms, Neural Computation 4(6) (1992), 888–900. doi:10.1162/neco.1992.4.6.888.

Chapelle and

Zien, Semi-supervised classification by low density separation, in: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics,

R.G.

Cowell and

Ghahramani, eds, Proceedings of Machine Learning Research, Vol. R5, PMLR, 2005, pp. 57–64. Reissued by PMLR on 30 March 2021, https://proceedings.mlr.press/r5/chapelle05b.html.

V.K.

Chauhan,

Dahiya and

Sharma, Problem formulations and solvers in linear SVM: A review, Artificial Intelligence Review 52 (2019), 803–855. doi:10.1007/s10462-018-9614-6.

Chong,

Ding,

Yan and

Pan, Graph-based semi-supervised learning: A review, Neurocomputing 408 (2020), 216–230. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S0925231220304938 . doi:10.1016/j.neucom.2019.12.130.

A.S.

Georghiades,

P.N.

Belhumeur and

D.J.

Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6) (2001), 643–660. doi:10.1109/34.927464.

He and

Niyogi, Locality preserving projections, in: Advances in Neural Information Processing Systems,

Thrun,

Saul and

Schölkopf, eds, Vol. 16, MIT Press, 2003, https://proceedings.neurips.cc/paper/2003/file/d69116f8b0140cdeb1f99a4d5096ffe4-Paper.pdf .

10.

Li,

Wang,

Bi and

Jiang, Revisiting transductive support vector machines with margin distribution embedding, Knowledge-Based Systems 152 (2018), 200–214. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S095070511830176X . doi:10.1016/j.knosys.2018.04.017.

11.

S.A.

Nene,

S.K.

Nayar and

Murase, Columbia Object Image Library (COIL-20), Technical report, 1996.

12.

Osada,

Funkhouser,

Chazelle and

Dobkin, Matching 3D models with shape distributions, in: Proceedings International Conference on Shape Modeling and Applications, 2001, pp. 154–166. doi:10.1109/SMA.2001.923386.

13.

Pande and

S.P.

Awate, Generative deep-neural-network mixture modeling with semi-supervised MinMax+EM learning, in: 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp. 5666–5673. doi:10.1109/ICPR48806.2021.9412739.

14.

S.T.

Roweis and

L.K.

Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290(5500) (2000), 2323–2326. doi:10.1126/science.290.5500.2323.

15.

F.S.

Samaria and

A.C.

Harter, Parameterisation of a stochastic model for human face identification, in: Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142. doi:10.1109/ACV.1994.341300.

16.

Shi,

H.-B.

Zhang,

Li,

J.-X.

Du,

Lei and

J.-H.

Liu, Shuffle-invariant network for action recognition in videos, ACM Trans. Multimedia Comput. Commun. Appl. 18(3) (2022). doi:10.1145/3485665.

17.

Shilane,

Min,

Kazhdan and

Funkhouser, The Princeton shape benchmark, in: Proceedings Shape Modeling Applications, 2004, pp. 167–178. doi:10.1109/SMI.2004.1314504.

18.

Sim,

Baker and

Bsat, The CMU pose, illumination, and expression database, IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12) (2003), 1615–1618. doi:10.1109/TPAMI.2003.1251154.

19.

Tang,

X.-S.

Hua,

Wang,

Gu,

G.-J.

Qi and

Wu, Correlative linear neighborhood propagation for video annotation, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2) (2009), 409–416. doi:10.1109/TSMCB.2008.2006045.

20.

J.B.

Tenenbaum,

de Silva and

J.C.

Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290(5500) (2000), 2319–2323. doi:10.1126/science.290.5500.2319.

21.

Tharwat,

Gaber,

Ibrahim and

A.E.

Hassanien, Linear discriminant analysis: A detailed tutorial, AI Communications 30(2) (2017), 169–190. doi:10.3233/AIC-170729.

22.

J.E.

Van Engelen and

H.H.

Hoos, A survey on semi-supervised learning, Machine Learning 109(2) (2020), 373–440. doi:10.1007/s10994-019-05855-6.

23.

Wang and

Zhang, Label propagation through linear neighborhoods, IEEE Transactions on Knowledge and Data Engineering 20(1) (2008), 55–67. doi:10.1109/TKDE.2007.190672.

24.

Wang,

Fan and

Wang, Comparative analysis of image classification algorithms based on traditional machine learning and deep learning, Pattern Recognition Letters 141 (2021), 61–67. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S0167865520302981 . doi:10.1016/j.patrec.2020.07.042.

25.

Widmann and

Verberne, Graph-based semi-supervised learning for text classification, in: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’17, Association for Computing Machinery, New York, NY, USA, 2017, pp. 59–66. ISBN 9781450344906. doi:10.1145/3121050.3121055.

26.

Wu and

Schölkopf, Transductive classification via local learning regularization, in: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics,

Meila and

Shen, eds, Proceedings of Machine Learning Research, Vol. 2, PMLR, San Juan, Puerto Rico, 2007, pp. 628–635. https://proceedings.mlr.press/v2/wu07a.html .

27.

Xiao,

Feng and

Liu, A new transductive learning method with universum data, Applied Intelligence 51 (2021), 5571–5583. doi:10.1007/s10489-020-02113-4.

28.

Yang,

Wei,

Wang,

X.-S.

Hua and

Zhang, Interactive self-training with mean teachers for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5941–5950.

29.

Yang,

Xu,

Nie,

Luo and

Zhuang, Ranking with local regression and global alignment for cross media retrieval, in: Proceedings of the 17th ACM International Conference on Multimedia, MM ’09, Association for Computing Machinery, New York, NY, USA, 2009, pp. 175–184. ISBN 9781605586083. doi:10.1145/1631272.1631298.

30.

H.-B.

Zhang,

Zhong,

Lei,

J.-X.

Du,

Peng,

Chen and

Ke, Sparse representation-based semi-supervised regression for people counting, ACM Trans. Multimedia Comput. Commun. Appl. 13(4) (2017). doi:10.1145/3106156.

31.

Zhang,

Tao,

Li and

Yang, Patch alignment for dimensionality reduction, IEEE Transactions on Knowledge and Data Engineering 21(9) (2009), 1299–1313. doi:10.1109/TKDE.2008.212.

32.

Zhang and

Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM Journal on Scientific Computing 26(1) (2004), 313–338. doi:10.1137/S1064827502419154.

33.

Zhou,

Bousquet,

Lal,

Weston and

Schölkopf, Learning with local and global consistency, in: Advances in Neural Information Processing Systems,

Thrun,

Saul and

Schölkopf, eds, Vol. 16, MIT Press, 2003, https://proceedings.neurips.cc/paper/2003/file/87682805257e619d49b8e0dfdc14affa-Paper.pdf .

34.

Zhu,

Ghahramani and

Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, AAAI Press, 2003, pp. 912–919. ISBN 1577351894.

Transductive classification via patch alignment

Abstract

Keywords

1. Introduction

2. Related works

2.1. Transductive classification

2.2. Patch alignment framework

3. Transductive classification via patch alignment

3.1. Local patch construction

3.2. Whole alignment

3.3. Comparative analysis with the existing method

4. Experiment and discussions

4.1. Databases

Table 1 Databases descriptions Databases Classes Dims Points Toy data 3 2 1500 COIL-20 20 1024 1440 3D Model 53 500 1811 ORL 40 1024 400 Yale-B 38 1024 2414 PIE 22 1024 3740 USPS 10 256 4000

Footnotes

Acknowledgements

References

Table 1
Databases descriptions

Databases Classes Dims Points

Toy data 3 2 1500

COIL-20 20 1024 1440

3D Model 53 500 1811

ORL 40 1024 400

Yale-B 38 1024 2414

PIE 22 1024 3740

USPS 10 256 4000