A new end-to-end semi-supervised deep learning framework for mastering robot-written character identification

Abstract

This paper studies the robot-written character identification problem under an end-to-end semi-supervised deep learning framework consisting of semi-supervised learning and deep learning modules. The learning framework allows a deep neural network to be trained on labeled and pseudo-labeled samples where pseudo-labeled samples refer to the samples with labels predicted by the semi-supervised learning module. Moreover, to guarantee the feasibility of the learning framework, a two-stage strategy is proposed for training the deep neural network. Specifically, the two-stage training strategy adopts pseudo-labeled samples firstly to train a deep neural network, then the deep neural network is refined using labeled samples one more time. As a result, more samples can be used for training a deep neural network, which is significant to the performance improvement of a deep neural network in the case of inadequate labeled samples. More importantly, the deep neural networks trained under the proposed learning framework perform better than the famous deep neural networks in a robot-written character identification experiment.

Keywords

Deep learning semi-supervised learning robot-written character neural networks

1 Introduction

Industrial products have always painted a series of characters by robots on their surface as identity. However, the characters written by robots are unique in font and shape, always different from the characters written by hands. The images of the characters written by a robot may lie on a complicated low-dimensional manifold. A feasible technical road map for recognizing industrial products is to learn the patterns of the character images and fit the complicated manifold of the images by a model. Therefore, the high-capacity models, such as deep neural networks, have been applied to the recognition of industrial products. Deep neural networks, especially convolutional neural networks, are perfect candidate models for learning the patterns of images[1–5]. By stacking a huge quantity of neurons layer by layer, a deep neural network can tackle a lot of complicated problems, such as image classification and speech recognition [6–9]. The progress in the computing capability of computers since the 1990s makes it possible to build a large-scale neural network and train a neural network with the aid of a GPU [10]. Over the past decades, a great number of famous deep network models were proposed. For example, Simonyan et al. [11] designed a deep convolutional network, called VGG, for large-scale image recognition where VGG consists of 16-19 layers with very small (3×3) convolution filters. Chollet found that the Inception modules in convolutional neural networks can be interpreted as an intermediate step between regular convolution and depthwise separable convolution. Then, inspired by Inception, Chollet [12] proposed a new deep convolutional neural network, called Xception, where Inception modules are replaced with depthwise separable convolutions. He et al. [13] train deeper neural networks by reformulating the layers as learning residual functions with reference to the layer inputs. A network becomes its residual version by inserting shortcut connections into a plain network. Computational efficiency and low parameter count are the main concerns for various application scenarios, such as mobile vision and big data. Szegedy et al. explained the benefit of Inception modules of convolutional neural networks in terms of the computational cost. Huang et al. [14] proposed a new Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. DenseNets have several advantages in terms of alleviating the vanishing gradient, strengthening feature propagation, encouraging feature reuse, and substantially reducing the number of parameters. Tan et al. [15] proposed a new scaling approach for uniformly scaling the depth, width, and resolution of convolutional neural networks. They evaluated their method of scaling up MobileNets and ResNet. Deep neural networks have been applied to a variety of industrial fields. For example, Chen et al. [34] proposed a semisupervised recurrent convolutional attention model for human activity recognition. Luo et al. [35] proposed an adaptive semisupervised feature analysis method for video semantic recognition. Zhang et al. [36] applied the spatio-temporal preserving representations for EEG-based human intention recognition. Liu et al. [37], Fang et al. [38], He et al. [39] and Du et al. [40] studied scene text detection problem with the semi-supervised learning theory. Bhunia et al. [41] studied a handwritten text recognition problem with a novel meta-learning framework.

Although deep neural networks show great priority compared with the traditional machine learning models, a huge number of labeled samples are required for training a deep neural network [16]. When labeled samples are adequate for model training, then deep neural networks have been proven to be superior to the traditional supervised models in real-world applications, such as image identification and speech recognition. However, labeled samples are inadequate in most real-world application scenarios. As a result, if unlabeled samples can play an important role in model training, then a deep neural network can get a performance boost. To allow a deep learning framework to take advantage of unlabeled samples for model training is a challenging problem. A possible solution is to assign pseudo labels for unlabeled samples and use the pseudo-labeled samples for model training.

To improve the learning ability of a deep neural network in the case of inadequate labeled samples, we construct a learning framework such that a deep neural network can get well-trained with the aid of unlabeled samples. The deep learning framework contains a semi-supervised learning module that is responsible for predicting the true labels of unlabeled samples. After unlabeled samples become pseudo-labeled samples, we then achieve the purpose of training sample set extension by constructing a training sample set of labeled and pseudo-labeled samples. More training samples allow a neural network to avoid the over-fitting problem. The innovations and contributions of this paper are summarized as follows:

1) The first contribution of this paper is to construct a deep learning framework consisted of two modules, including a semi-supervised learning module and a deep neural network module. The learning framework allows a deep neural network to avoid overfitting problem by extending the training sample set with unlabeled samples.

2) The learning framework consists of a semi-supervised learning module for predicting the labels of unlabeled samples so that unlabeled samples can become pseudo-labeled samples. Then, we obtain an extended training sample set of labeled and pseudo-labeled samples, and a deep neural network is allowed to be trained on the extended training sample set.

3) To guarantee the feasibility of the learning framework, a two-stage training strategy is proposed for training a deep neural network. The two-stage strategy allows a deep neural network to be trained on the pseudo-labeled first and labeled samples one more time. The two-stage training strategy allows the deep neural network to learn from the pseudo-labeled samples roughly and then to refine the deep neural network using the labeled samples.

4) We implemented an experiment on robot-written character identification to demonstrate the advantages of the proposed learning framework. The experimental results showed that the deep neural network trained under the proposed learning framework outperforms a couple of well-known deep neural networks in terms of identification accuracy.

The remainder of this article is organized as follows. Section 2 will introduce the learning strategy taken by the proposed learning framework and two important modules of the learning framework. In Section 3, an experiment will be implemented to evaluate the performance of the proposed learning framework. At last, conclusions are drawn.

2 The proposed end-to-end semi-supervised deep learning framework

There are two important modules, including a semi-supervised learning module and a deep neural network module, in the proposed learning framework. The layout of the proposed deep learning framework is depicted in Fig. 2. Under the deep learning framework, the whole sample set will be split into two parts, including a group of labeled samples, and a group of unlabeled and labeled samples. Then, the semi-supervised learning module will be used for predicting the labels of unlabeled samples so that unlabeled samples become pseudo-labeled samples. Finally, the deep neural network is allowed to be trained on both pseudo-labeled and labeled samples with the aid of a two-stage training strategy.

Fig. 1

Illustration of the proposed semi-supervised deep learning framework.

The purpose of supervised learning frameworks is to infer the labels of unlabeled samples using a model that is well-trained on labeled samples. However, inferring the labels of unlabeled samples is not the ultimate goal of the proposed deep learning framework, but for training sample extension so that a deep neural network can be trained on more samples. The two-module learning framework allows a deep neural network to take advantage of pseudo-labeled samples for model training firstly. Then, labeled samples are used to refine the deep neural network.

2.1 The semi-supervised learning module of the proposed learning framework

The semi-supervised learning module of the proposed deep learning framework is important to the overall performance of a deep neural network. We propose a high-performance semi-supervised learning module with manifold learning and kernel learning theories. Next, we start to introduce the semi-supervised learning module. Suppose that samples are drawn from a probability distribution P on $X \times R$ and the conditional probabilities P (y|x_i) and P (y|x_j) are same as much as possible if x_i and x_j are close to each other on a manifold. The semi-supervised learning module aims to learn a function f so that, given a set of labeled samples {(x₁, y₁) , ⋯ , (x_l, y_l)} and a set of unlabeled samples {x_l+1, ⋯ , x_n} from k categories, the labels of unlabeled samples can be inferred based on the labeled samples. In general, the function f can be given by [17–19]

$\begin{matrix} f = arg min_{f \in H} \frac{1}{l} \sum_{j = 1}^{l} L (x_{j}, y_{j}, f) + γ_{0} ∥ f ∥_{H}^{2} \\ + γ_{1} ∥ f ∥_{M}^{2} \end{matrix}$ (1) where L (· , · , ·) represents a loss function, y_j is the label of sample x_j for j = 1, ⋯ , l, $H$ represents the reproducing Hilbert space induced by a Mercer kernel K (· , ·): $X \times X \mapsto R$ , $M$ represents the manifold of samples, γ₀ and γ₁ control the weights of penalty terms $∥ f ∥_{H}$ and $∥ f ∥_{M}$ , respectively, and penalizing $f ∥_{H}$ and $∥ f ∥_{M}$ results in a smooth function f.

According to the Representer theorem [17], the solution to optimization problem (1) exists in $H$ as follows

$\begin{matrix} f (x) = \sum_{i = 1}^{n} α_{i} K (x, x_{i}) \end{matrix}$ (2) where α_i is a scalar for i = 1, ⋯ , n, and K (· , ·) is a kernel function, such as Gaussian kernel

$\begin{matrix} K (x_{i}, x_{j}) = \exp (- \frac{∥ x_{i} - x_{j} ∥_{F}^{2}}{δ}) . \end{matrix}$ (3)

Inspired by the above semi-supervised learning theory, the proposed semi-supervised learning module is

$\begin{matrix} [\begin{matrix} Y_{l} \\ Y_{u} \end{matrix}] \approx [\begin{matrix} K_{l} \\ K_{u} \end{matrix}] C \end{matrix}$ (4) where $C \in R_{+}^{n \times k}$ is a nonnegative coefficient matrix, $Y_{l} \in R^{l \times k}$ and $Y_{u} \in R^{n - l \times k}$ are two label matrices of labeled and unlabeled samples, respectively, and $K_{l} \in R^{n_{l} \times n_{l}}$ and $K_{u} \in R^{n_{u} \times n_{u}}$ are two matrices whose elements are defined by

$\begin{matrix} K_{ij} = K (x_{i}, x_{j}) \end{matrix}$ (5) where K_ij represents the element of $K = [K_{l}^{T} K_{u}^{T}]^{T}$ at i-th row and j-th column.

Because the labels of labeled samples are already known, thus label matrix Y_l is known as follows $\begin{matrix} {(Y_{l})}_{ij} \\ = {\begin{matrix} 1, & if i - th sample is from j - th category \\ 0, & otherwise \end{matrix} . \end{matrix}$ The above strategy for initializing Y_l is known as one-hot encoding. As a result, the column index of the maximal element of each row of matrix $Y = [Y_{l}^{T} Y_{u}^{T}]^{T} \in R^{n \times k}$ indicates the category of a sample.

Because K is a nonnegative matrix, thus matrix C should also be a nonnegative matrix so that the product between K and C can result in a nonnegative label matrix Y_u. In conclusion, the purpose of the proposed semi-supervised learning module is to search the unknown label matrix Y_u and unknown coefficient matrix C by the following optimization problem

$\begin{matrix} {Y_{u}, C} = arg min_{Y_{u} \geq 0, C \geq 0} \frac{1}{2} {∥ [\begin{matrix} Y_{l} \\ Y_{u} \end{matrix}] - [\begin{matrix} K_{l} \\ K_{u} \end{matrix}] C ∥}_{F}^{2} \\ + \frac{1}{2} ∥ C ∥_{F}^{2} + \frac{1}{2} trace (Y^{T} LY) \end{matrix}$ (6) where ∥ · ∥ _F represents the Frobenius norm of a matrix, trace (·) is the trace of a matrix, the following relations

$\begin{matrix} f ∥_{H}^{2} = ∥ C ∥_{F}^{2} \end{matrix}$ (7) and

$\begin{matrix} f ∥_{M}^{2} = trace (Y^{T} LY) \end{matrix}$ (8) were used, L = D - W is a Laplacian matrix, W is an affinity matrix with

$\begin{matrix} W_{ij} = \exp (- \frac{∥ x_{i} - x_{j} ∥_{F}^{2}}{δ}), \end{matrix}$ (9) and D_ii = ∑_jW_ij.

For optimization problem (6), we define the following objective function to be minimized

$\begin{matrix} min J (Y_{u}, C) = \frac{1}{2} (∥ Y_{l} - K_{l} C ∥_{F}^{2} + ∥ Y_{u} - K_{u} C ∥_{F}^{2}) \\ + \frac{1}{2} ∥ C ∥_{F}^{2} + \frac{1}{2} trace (Y^{T} LY) \\ = \frac{1}{2} (∥ Y_{l} - K_{l} C ∥_{F}^{2} + ∥ Y_{u} - K_{u} C ∥_{F}^{2} + ∥ C ∥_{F}^{2}) \\ + \frac{1}{2} trace ([\begin{matrix} L_{(ll)} & L_{(lu)} \\ L_{(ul)} & Y_{(uu)} \end{matrix}] [\begin{matrix} Y_{l} Y_{l}^{T} & Y_{l} Y_{u}^{T} \\ Y_{u} Y_{l}^{T} & Y_{u} Y_{u}^{T} \end{matrix}]) \\ = \frac{1}{2} (∥ Y_{l} - K_{l} C ∥_{F}^{2} + ∥ Y_{u} - K_{u} C ∥_{F}^{2} + ∥ C ∥_{F}^{2}) \\ + \frac{1}{2} trace (L_{(ll)} Y_{l} Y_{l}^{T} + 2 L_{(lu)} Y_{u} Y_{l}^{T} + L_{(uu)} Y_{u} Y_{u}^{T}) \\ subject to Y_{u} \geq 0, C \geq 0 \end{matrix}$ (10)

where L_(ll), L_(lu), L_(ul), and Y_(uu) are sub-blocks of matrix L with appropriate dimensions.

For above constrained optimization problem, a Lagrange function can be defined as follows

$\begin{matrix} L (Y_{u}, C, Φ, Ψ) \\ = \frac{1}{2} (∥ Y_{l} - K_{l} C ∥_{F}^{2} + ∥ Y_{u} - K_{u} C ∥_{F}^{2}) \\ + \frac{1}{2} trace (L_{(ll)} Y_{l} Y_{l}^{T} + 2 L_{(lu)} Y_{u} Y_{l}^{T} + L_{(uu)} Y_{u} Y_{u}^{T}) \\ + \frac{1}{2} ∥ C ∥_{F}^{2} - trace (Φ^{T} Y_{u}) - trace (Ψ^{T} C) \end{matrix}$ (11)

where Φ and Ψ are two matrices of Lagrangian multipliers.

The partial derivative of L (Y_u, C, Φ Ψ) with respect to (Y_u) _ij is

$\begin{matrix} \frac{\partial L (Y_{u}, C, Φ, Ψ)}{\partial {(Y_{u})}_{ij}} = {(Y_{u})}_{ij} + {(L_{(uu)} Y_{u})}_{ij} \\ + {(L_{(lu)}^{T} Y_{l})}_{ij} - {(K_{u} C)}_{ij} - Φ_{ij} \end{matrix}$ (12)

According to the Karush-Kuhn-Tucher (KKT) condition Φ_ij (Y_u) _ij = 0, we then have

$\begin{matrix} {(Y_{u})}_{ij} \leftarrow {(Y_{u})}_{ij} \times \\ \sqrt{\frac{{(K_{u} C)}_{ij}}{{(Y_{u})}_{ij} + {(L_{(uu)} Y_{u})}_{ij} + {(L_{(lu)}^{T} Y_{l})}_{ij}}} \end{matrix}$ (13)

The partial derivative of L (Y_u, C, Φ Ψ) with respect to C_ij is

$\begin{matrix} \frac{\partial L (Y_{u}, C, Φ, Ψ)}{\partial C_{ij}} = {(K_{l}^{T} K_{l} C)}_{ij} + {(K_{u}^{T} K_{u} C)}_{ij} \\ - {(K_{l}^{T} Y_{l})}_{ij} - {(K_{u}^{T} Y_{u})}_{ij} \\ + C_{ij} - Ψ_{ij} \end{matrix}$ (14)

According to the KKT condition Ψ_ijC_ij = 0, we then have

$\begin{matrix} C_{ij} \leftarrow C_{ij} \sqrt{\frac{{(K_{l}^{T} Y_{l})}_{ij} + {(K_{u}^{T} Y_{u})}_{ij}}{{(K_{l}^{T} K_{l} C)}_{ij} + {(K_{u}^{T} K_{u} C)}_{ij} + C_{ij}}} \end{matrix}$ (15)

The algorithm for searching the label matrix Y_u and coefficient matrix C is to apply the update rules (13) and (15) in turns until the objective function J (Y_u, C) reach its minimum. Note that the update rules for C and Y_u are element-wise multiplicative update rules to guarantee the nonnegativity of C and Y_u. Next, we discuss the convergence of the multiplicative update rules.

For above algorithm, we have the following conclusions: 1) for a fixed C, the objective function J (Y_u, C) is non-increasing under the update rule for Y_u; 2) the objective function J (Y_u, C) is non-increasing under the update rule for C, for a fixed Y_u. Next, we will take an auxiliary function-based method to prove the convergence of above algorithm.

Given a nonnegative matrix C′, we construct

$\begin{matrix} G (C, C^{'}) = \frac{1}{2} trace (Y_{l}^{T} Y_{l}) + \frac{1}{2} trace (Y_{u}^{T} Y_{u}) \\ - \sum_{ij} {(K_{l}^{T} Y_{l})}_{ij} {C^{'}}_{ij} (1 + \log \frac{C_{ij}}{{C^{'}}_{ij}}) \\ - \sum_{ij} {(K_{u}^{T} Y_{u})}_{ij} {C^{'}}_{ij} (1 + \log \frac{C_{ij}}{{C^{'}}_{ij}}) \\ + \frac{1}{2} \sum_{ij} \frac{{(Y_{u}^{T} Y_{u} C^{'})}_{ij} C_{ij}^{2}}{{C^{'}}_{ij}} + \frac{1}{2} \sum_{ij} \frac{{(Y_{l}^{T} Y_{l} C^{'})}_{ij} C_{ij}^{2}}{{C^{'}}_{ij}} \\ + \frac{1}{2} ∥ C ∥_{F}^{2} + \frac{1}{2} trace (Y {LY}^{T}) \end{matrix}$ (16)

as an auxiliary function of J (Y_u, C). Because

$\begin{matrix} J (Y_{u}, C^{'}) \leq G (C, C^{'}), \end{matrix}$ (17) thus according to the definition of auxiliary function [33], G (C, C′) is definitely an auxiliary function of J (Y_u, C′).

If the nonnegative matrix C′ is given by

$\begin{matrix} C^{'} = arg min_{C} G (C, C^{'}), \end{matrix}$ (18) then we have the following inequalities

$\begin{matrix} J (Y_{u}, C^{'}) \leq G (C, C^{'}) \leq G (C, C) = J (Y_{u}, C) . \end{matrix}$ (19) About inequalities demonstrate that the objective function J (Y_u, C′) is non-increasing under the update rule for C′ derived from (18).

The inequalities in (19) inspire us to compute the partial derivative of G (C, C′) with respect to C_ij as follows

$\begin{matrix} \frac{\partial G (C, C^{'})}{\partial C_{ij}} = - {(K_{l}^{T} Y_{l})}_{ij} \frac{{C^{'}}_{ij}}{C_{ij}} - {(K_{u}^{T} Y_{u})}_{ij} \frac{{C^{'}}_{ij}}{C_{ij}} \\ + \frac{{(K_{u}^{T} K_{u} C^{'})}_{ij} C_{ij}}{{C^{'}}_{ij}} + \frac{{(K_{l}^{T} K_{l} C^{'})}_{ij} C_{ij}}{{C^{'}}_{ij}} + C_{ij}, \end{matrix}$ (20) then $\frac{\partial G (C, C^{'})}{\partial C_{ij}} = 0$ results in an update rule for C_ij that is identical to (15). Therefore, the objective function J (Y_u, C′) is non-increasing update under the update rule (15).

Similarly, given a nonnegative matrix Y′_u, we construct

$\begin{matrix} F (Y_{u}, {Y^{'}}_{u}) = \frac{1}{2} trace (Y_{u}^{T} Y_{u}) \\ + trace (C^{T} K_{u}^{T} K_{u} C) \\ - \sum_{ij} {(K_{u} C)}_{ij} {({Y^{'}}_{u})}_{ij} (1 + \log \frac{{(Y_{u})}_{ij}}{{({Y^{'}}_{u})}_{ij}}) \\ + \frac{1}{2} \sum_{ij} \frac{{(L_{(uu)} {Y^{'}}_{u})}_{ij} {(Y_{u})}_{ij}^{2}}{{({Y^{'}}_{u})}_{ij}} \\ + \frac{1}{2} \sum_{ij} {(L_{(lu)}^{T} Y_{l})}_{ij} \frac{{(Y_{u})}_{ij}^{2} + {({Y^{'}}_{u})}_{ij}^{2}}{{({Y^{'}}_{u})}_{ij}} \\ + \frac{1}{2} ∥ Y_{l} - K_{l} C ∥_{F}^{2} + \frac{1}{2} ∥ C ∥_{F}^{2} + \frac{1}{2} trace (L_{(ll)} Y_{l} Y_{l}^{T}) \end{matrix}$ (21)

as an auxiliary function of J (Y_u, C).

Obviously, the following inequality holds

$\begin{matrix} J (Y_{u}, C) \leq F (Y_{u}, {Y^{'}}_{u}), \end{matrix}$ (22) i.e., F (Y_u, Y′_u) is an auxiliary function of J (Y_u, C′).

If the nonnegative matrix Y′_u is given by

$\begin{matrix} {Y^{'}}_{u} = arg min_{Y_{u}} F (Y_{u}, {Y^{'}}_{u}), \end{matrix}$ (23) then we have the following inequalities

$\begin{matrix} J (Y_{u}, C^{'}) \leq F (Y_{u}, {Y^{'}}_{u}) \\ \leq F (Y_{u}, Y_{u}) = J (Y_{u}, C) \end{matrix}$ (24)

The partial derivative of F (Y_u, Y′_u) with respect to (Y_u) _ij is

$\begin{matrix} \frac{\partial F (Y_{u}, {Y^{'}}_{u})}{\partial {(Y_{u})}_{ij}} = {(Y_{u})}_{ij} - {(K_{u} C)}_{ij} \frac{{({Y^{'}}_{u})}_{ij}}{{(Y_{u})}_{ij}} \\ + \frac{{(L_{(uu)} {Y^{'}}_{u})}_{ij} {(Y_{u})}_{ij}}{{({Y^{'}}_{u})}_{ij}} + {(L_{(lu)}^{T} Y_{l})}_{ij} \frac{{(Y_{u})}_{ij}}{{({Y^{'}}_{u})}_{ij}} \end{matrix}$ (25)

Then $\frac{\partial F (Y_{u}, {Y^{'}}_{u})}{\partial {(Y_{u})}_{ij}} = 0$ results in an update rule for (Y_u) _ij that is identical to (13). Therefore, the objective function J (Y_u, C′) is non-increasing update under the update rule (13). As a result, the algorithm for searching Y_u and C is convergent.

In conclusion, the multiplicative update rules (13) and (15) can not only guarantee the convergence but also guarantee the nonnegativity of matrices C and Y_u. Moreover, the pseudo-labels of unlabeled samples are available after Y_u gets convergent. Finally, a set of labeled samples and a set of pseudo-labeled samples are available for training a deep neural network.

2.2 The deep neural network module of the proposed learning framework

A deep neural network can be competent for classification tasks only when it gets well-trained on a group of labeled training samples. Model training is to search a set of parameters using an optimization algorithm such that the deep neural network can output a correct label y for each training sample x. Meanwhile, model training should also pay attention to the generalization performance. This section will discuss the strategy for training a deep neural network.

In the proposed end-to-end semi-supervised learning framework, the output layer of the deep neural network is a soft-max activation layer. As a result, given an input sample, the output of the deep neural network is a label vector

$\begin{matrix} y_{i}^{dnn} = \frac{e^{y_{i}^{dnn}}}{\sum_{j} e^{y_{j}^{dnn}}} \end{matrix}$ (26) where $y_{i}^{dnn}$ is i-th element of label vector y^dnn.

The cross-entropy loss function defined as follows is used for model training

$\begin{matrix} loss = \sum_{j = 1}^{n - l} H ({(Y_{u})}_{j :} ∥ {(Y_{u}^{dnn})}_{j :}) \\ + \sum_{j = 1}^{l} H ({(Y_{l})}_{j :} ∥ {(Y_{l}^{dnn})}_{j :}) \end{matrix}$ (27) where H (· ∥ ·) denotes the cross entropy between two possibility distributions, Y_l and Y_u represent two label matrices of labeled samples and pseudo-labeled samples, respectively, and $Y_{l}^{dnn}$ and $Y_{u}^{dnn}$ represent two matrices of the outputs of the deep neural network with respect to labeled and pseudo-labeled samples, respectively, (Y_l) _j: and (Y_u) _j: represent j-th row of Y_l and Y_u, respectively.

Given an one-hot label vector y of sample x and the output y^dnn of the deep neural network with respect to x, the following relations hold

$\begin{matrix} H (y ∥ y^{dnn}) = H (y) + D_{KL} (y ∥ y^{dnn}) \\ = \sum_{j} y_{j} \log (\frac{y_{j}}{y_{j}^{dnn}}) \\ - \sum_{j} y_{j} \log (y_{j}) \end{matrix}$ (28) where H (y) represents the entropy of vector y, and D_KL (y ∥ y^dnn) represents the Kullback-Leibler Divergence between y and y^cnn. Because y is an one-hot vector, thus H (y) =0. Further, we have the following relations

$\begin{matrix} H (y ∥ y^{dnn}) = D_{KL} (y ∥ y^{dnn}) \\ = \sum_{j} y_{j} \log (\frac{y_{j}}{y_{j}^{dnn}}) \\ = 1 \times \log (\frac{1}{y_{i}^{dnn}}) + \sum_{j \neq i} 0 \times \log (\frac{0}{y_{j}^{dnn}}) \\ = - \log (y_{i}^{dnn}) \end{matrix}$ (29) where y_j = 0 for j = 1, ⋯ , k except j = i.

Applying (29) to (27), we can obtain a well-defined loss function for model training. Next, we will introduce the two-stage strategy for training a deep neural network using pseudo-labeled and labeled samples, separately. The pseudo-labeled samples are responsible for pre-training the deep neural network, while the labeled samples are responsible for fine-tuning the deep neural network.

The first stage of the two-stage training strategy is to minimize the loss function on pseudo-labeled samples using a numerical optimization algorithm as follows

$\begin{matrix} θ \leftarrow θ - lr \times \frac{\partial \sum_{j = 1}^{n - l} H ({(Y_{u})}_{j :} ∥ {(Y_{u}^{dnn})}_{j :})}{\partial θ} \end{matrix}$ (30) where θ is the trainable parameter set of the deep neural network, and lr represents the learning rate. The optimization algorithm used in this paper is Adam that is a kind of stochastic gradient-descent algorithm with an adaptively variable learning rate instead of a fixed learning rate. We then can take advantage of the back-propagation algorithm to update the model parameters with the end-to-end open source machine learning platform TensorFlow [16].

The second stage of the two-stage training strategy is to train the deep neural network using labeled samples. In this stage, a numerical optimization algorithm as follows is adopted to search a set of model parameters

$\begin{matrix} θ \leftarrow θ - lr \times \frac{\partial \sum_{j = 1}^{l} H ({(Y_{l})}_{j :} ∥ {(Y_{l}^{dnn})}_{j :})}{\partial θ} \end{matrix}$ (31)

After model training is finished, we then obtain a well-trained deep neural network. The procedures for training a deep neural network are summarized in Table

Table 1

The procedures for training a deep neural network

Steps	Calculations
1	Collect a set of samples from k categories
2	Split sample set into labeled , unlabeled, and test samples
3	Initialize matrix Y_l and parameter δ
4	Calculate matrix K, W, D, and L
5	Initialize Y_u and C with nonnegative matrices randomly
6	loop :
7	for i = 1 to n - l
8	for j = 1 to k
9	$5 pt {(Y_{u})}_{ij} \leftarrow {(Y_{u})}_{ij} \sqrt{\frac{{(K_{u} C)}_{ij}}{{(Y_{u})}_{ij} + {(L_{(uu)} Y_{u})}_{ij} + {(L_{(lu)}^{T} Y_{l})}_{ij}}}$
10	end
11	end
12	for i = 1 to n
13	for j = 1 to k
14	$7 pt C_{ij} \leftarrow C_{ij} \sqrt{\frac{{(K_{l}^{T} Y_{l})}_{ij} + {(K_{u}^{T} Y_{u})}_{ij}}{{(K_{l}^{T} K_{l} C)}_{ij} + {(K_{u}^{T} K_{u} C)}_{ij} + C_{ij}}}$
15	end
16	end
17	end loop : until convergence
18	Specify the architecture of the deep neural network
19	Select an optimization algorithm for model training
20	Set the epochs for model training
21	for i = 1 to epochs
22	$10 pt θ \leftarrow θ - lr \times \frac{\partial \sum_{j = 1}^{n - l} H ({(Y_{u})}_{j :} ∥ {(Y_{u}^{dnn})}_{j :})}{\partial θ}$
23	end
24	for i = 1 to epochs
25	$10 pt θ \leftarrow θ - lr \times \frac{\partial \sum_{j = 1}^{l} H ({(Y_{l})}_{j :} ∥ {(Y_{l}^{dnn})}_{j :})}{\partial θ}$
26	end
27	After θ gets convergent, launch the deep neural network

2 An experiment on robot-written character identification

In this section, we will implement an experiment to evaluate the performance of the semi-supervised deep learning framework. Moreover, we will compare the deep neural network trained under the semi-supervised learning framework with both traditional machine learning models (not deep neural networks) and famous deep neural networks in the experiment. Traditional machine learning models include decision tree (DT) [31], k-nearest neighbors (KNN) [27–30], support vector machine (SVM) [20–23], and kernel support vector machine (KSVM) [24–26]. Deep neural network models include Vgg16 [11], DenseNet [14], Xception [12], ResNet [13], EfficientNet [15]. In this experiment, each model will confront a robot-written character identification task where the characters were written by a robot on the surface of steel coils. The images of the working robot are shown in Fig. 2. The sample images of the characters written by the robot are shown in Fig. 3.

Fig. 2

Images of the working robot.

Fig. 3

Two sample images of the characters written by the robot.

We take several steps for solving the robot-written character identification problem, such as model selection, image preprocessing, model training, and model evaluation. The first step is to design the architecture of the deep neural network. The deep neural network for the character identification task is a convolutional neural network (CNN), as shown in Fig. 4. The CNN consists of three blocks where the first and second blocks include convolutional layers and max-pooling layers, respectively, and the third block consists of a set of fully-connected layers. Specifically, the convolutional layer in the first block has a kernel size=5, filters=9, padding=same, and the max-pooling layer in the first block has a pool-size=2, a batch normalization, and a ReLU activation. The second block of the CNN consists of convolutional layers with kernel-size=5, filters=27, and padding=same, and the max-pooling layers in the second block have a pool-size=2, a batch normalization, and a ReLU activation. The third block of the CNN consists of a flattened layer and three fully-connected layers where the numbers of the neurons in the fully-connected layers are 215, 75, and 15, respectively. Finally, the output of the CNN comes from the last fully-connected layer with a softmax activation.

Fig. 4

Illustration of the architecture of the CNN for robot-written character identification. The CNN will serve as a classifier whose input is an image, and output is a vector, i.e., classification result of the input image.

The second step is image preprocessing. The whole image set consists of a total of 285 images. Image preprocessing includes image resizing, binary image transformation, de-noising, and gray image transformation, where gray image transformation is responsible for transforming the original images into grayscale images, and image resizing is responsible for transforming the grayscale images into new images with the size of 40×80 pixels, de-noising is responsible for removing noises from the images by a median filter, and binary image transformation is to transform the images into binary images. The sample images after image preprocessing are shown in Fig. 5. Moreover, the visualization result of these images in a two-dimensional space by t-SNE is given by Fig. 6 [32].

Fig. 5

Preprocessed images of the characters.

Fig. 6

Visualization result of the preprocessed images by t-SNE. Each type of the processed character images is mapped onto a two-dimensional plane and represented by dots in different colors and sizes.

The third step is model training. In this step, we split the whole image set into three parts, including a set of labeled images, a set of unlabeled images, and a set of images for the test. Then we take the semi-supervised learning module to predict the labels of unlabeled images so that unlabeled images become pseudo-labeled images. Finally, we take the two-stage training strategy to train the CNN on the labeled and pseudo-labeled images, separately.

The last step is to launch the well-trained CNN and evaluate its performance on the test images. Moreover, the compared models were trained on the labeled images. The strategy taken by SVM and KSVM for this multi-classes image identification task is one-vs-rest. To evaluate the performance of each model quantitatively, we selected the accuracy of image identification as the performance metric where accuracy can be derived from the confusion matrices about test images. Confusion matrices computed by the models are shown in Figs. 7 and 8 in the case of 20% labeled images. Compared with Figs. 7 and 8, one can conclude that the CNN performs better than traditional models in terms of accuracy.

Fig. 7

Confusion matrices of DT, KNN, SVM, and KSVM with 20% labeled training images where the confusion matrix closing to a diagonal matrix indicates a higher accuracy.

Fig. 8

Confusion matrix of the CNN with 20% labeled training images. The confusion matrix in Fig. 8 is more like a diagonal matrix than the confusion matrices in Fig. 7. As a result, the CNN has a higher identification accuracy than its competitors.

We also evaluated the performance of each model with different numbers of labeled training images. Therefore, we increase the number of labeled training images gradually to observe the performance change of each model. When the number of labeled images increases from 15% to 35% with an interval of 5%, the performance change of each model is summarized in Table 2. Table shows that the CNN trained under the semi-supervised deep neural network performs best when the number of labeled images is small. Moreover, the performance of the CNN keeps at a high level with the number of labeled images increasing. In conclusion, the CNN trained under the new semi-supervised deep learning framework almost outperforms the traditional models no matter how the number of labeled images changes, as shown in Fig. 9.

Fig. 9

Identification accuracy distributions of the traditional models with different numbers of labeled training image.

Table 2

Identification accuracies of the traditional models with different numbers of labeled images (LI)

Models	15 % LI	20 % LI	25 % LI	30 % LI	35 % LI
DT	46.28%	52.63%	62.62%	72.00%	60.22%
KNN	61.16%	71.05%	82.24%	95.00%	93.55%
SVM	90.91%	92.98%	100.00%	100.00%	100.00%
KSVM	91.73%	90.35%	91.59%	99.00%	98.92%
CNN	95.87%	100.00%	100.00%	99.00%	100.00%

The identification accuracies of the deep models with different numbers of labeled images are depicted in Fig. 10. Moreover, the identification accuracy of each deep model is listed in Table 3. Compared with the identification accuracy of the CNN, Table 3 indicates that the deep models yield lower identification accuracy and do not competent to the image identification with a small number of labeled training images, which agrees with the conclusions we conclude previous sections of this paper, i.e., the performance of deep neural networks highly depend on the quantity of labeled training samples.

Fig.10

Identification accuracy distributions of the deep models with different numbers of labeled training image.

Table 3

Identification accuracies of the deep models with different numbers of labeled images (LI)

Models	15 % LI	20 % LI	25 % LI	30 % LI	35 % LI
Vgg16	14.81%	67.54%	96.72%	72.00%	97.85%
Den . Net	13.58%	15.35%	28.50%	44.00%	55.91%
Xception	34.15%	50.44%	84.58%	87.5%	86.02%
ResNet	38.68%	85.96%	89.25%	98.99%	100.00%
Eff . Net	45.27%	53.07%	83.18%	100.00%	100.00%

To evaluate the computation efficiency of each model, we summarize the time spent on each model for identifying the test images. Tables 4 and 5 show the computation efficiency of the traditional models and deep models, respectively. The computation was tested with a laptop with an I7-CPU and RTX-2060 GPU.

Table 4

Computation efficiency of the traditional models for identifying testing images

Models	15 % LI	20 % LI	25 % LI	30 % LI	35 % LI
DT	0.002s	0.003s	0.002s	0.001s	0.001s
KNN	0.028s	0.032s	0.041s	0.042s	0.048s
SVM	0.109s	0.108s	0.116s	0.112s	0.108s
KSVM	0.183s	0.194s	0.211	0.205s	0.200s
CNN	0.191s	0.138s	0.131s	0.230s	0.132s

Table 5

Computation efficiency of the deep models for identifying testing images

Models	15 % LI	20 % LI	25 % LI	30 % LI	35 % LI
Vgg16	1.374s	1.256s	1.199s	1.166s	1.066s
Den . Net	2.344s	2.811s	2.440s	2.375s	4.033s
Xception	1.546s	1.440s	1.403s	1.340s	1.284s
ResNet	1.581s	1.455s	1.389s	1.362s	1.339s
Eff . Net	1.546s	1.849s	1.389s	1.432s	1.398s

2 Conclusions

We have presented an end-to-end semi-supervised deep learning framework for training a deep neural network when labeled samples are inadequate. By assigning pseudo labels for unlabeled samples, a deep neural network is allowed to be trained on an extended sample set consisting of pseudo-labeled samples and labeled samples. The proposed learning framework consists of two modules, including a semi-supervised learning module and a deep neural network module, where the semi-supervised learning module is responsible for predicting the labels of unlabeled samples so that the unlabeled samples can become pseudo-labeled samples, and the deep neural network serves as an interface of the learning framework for out-of-sample identification. To guarantee the feasibility of the learning framework, we then developed a two-stage strategy for training a deep neural network.

We have drawn some important conclusions from the experiment. The first conclusion is that gigantic deep neural networks are incapable of the robot-written character identification task since they need more labeled images for model training. The second conclusion is that some traditional tiny machine learning models, such as KNN and SVM, show better performance because they need fewer labeled images for model training. The third conclusion is that even though few labeled images are available for model training, we can still train a high-performance deep neural network with the efficient use of unlabeled images. More importantly, we proved that the model trained under the new learning framework is superior to the well-known models, including decision tree, k-nearest neighbors, support vector machine, kernel support vector machine, Vgg16, Xception, ResNet, DenseNet, and EfficientNet, through an experiment on the robot-written character identification.

Footnotes

Acknowledgment

This work was supported by the Fundamental Research Funds for the Central Universities (3132022138).

References

Schmidhuber

, Deep learning in neural networks: an overview, Neural Networks 61 (2015), 85–117.

Chen

, Papandreou

, Kokkinos

, Murphy

and Yuille

, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4) (2018), 834–848.

Cao

, Huang

and Sun

, Building feature space of extreme learning machine with sparse denoising stacked-autoencoder, Neurocomputing 174 (2016), 60–71.

Jiang

, He

, Xie

and Tang

, Stacked multilevel-denoising autoencoders: a new representation learning approach for wind turbine gearbox fault diagnosis, IEEE Transactions on Instrumentation and Measurement 66(9) (2017), 2391–2402.

Krizhevsky

, Sutskever

and Hinton

, ImageNet classification with deep convolutional neural networks, Communication ACM 60(6) (2017), 84–90.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

and Salakhutdinov

, Dropout: a simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014), 1929–1958.

Liao

, Jin

and Pavel

, Enhanced Restricted Boltzmann Machine With Prognosability Regularization for Prognostics and Health Assessment, IEEE Transactions on Industrial Electronics 63(11) (2016), 7076–7083.

Rawat

and Wang

Z.H.

, Deep convolutional neural networks for image classification: A comprehensive review, Neural Computing 29(9) (2017), 2352–2449.

Hinton

and Salakhutdinov

, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507.

10.

Lecun

, Bengio

and Hinton

, Deep learning, Nature 521 (2015), 436–444.

11.

Simonyan

, Zisserman

Very deep convolutional networks for large-scale image recognition, In: Y. Bengio, Y. Lecun, editors, Proceedings of the 3rd International Conference on Learning Representations; 2015 May 7–9; San Diego, CA. arXiv (2014).

12.

Chollet

Xception: Deep Learning with Depthwise Separable Convolutions, In: J. DiCarlo, H. Shum, D. Jurafsky, editors, Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition; 2017 July 22–25; Honolulu, HI. p. 1800–1807.

13.

, Zhang

, Ren

, Sun

Deep residual learning for image recognition, In: R. Bajcsy, F.F. Li, T. Tuytelaars, editors, Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition; 2016 June–July 26-1; Las Vegas, Nevada, p. 770–778.

14.

Huang

, Liu

, Maaten

, Weinberger

Densely connected convolutional networks, In: J. DiCarlo, H. Shum, D. Jurafsky, editors, Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition; 2017 July 22–25; Honolulu, HI. p. 2261–2269.

15.

Tan

, Le

EfficientNet: rethinking model scaling for convolutional Neural Networks In: J. Pineau, J. Langford, editors, Proceedings of the 36th International Conference on Machine Learning; 2019 June 9–15; Long Beach Convention Center, LB. p. 1–10.

16.

Rumelhart

, Hinton

and Williams

, Learning representations by back propagating errors, Nature 323(6088) (1986), 533–536.

17.

Belkin

and Niyogi

, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15(6) (2003), 1373–1396.

18.

Baghshah

M.S.

and Shouraki

S.B.

, Kernel-based metric learning for semi-supervised clustering, Neurocomputing 73(7-9) (2010), 1352–1361.

19.

Belkin

, Niyogi

and Vikas

, Manifold regularization: a geometric framework for learning from labeled and unlabeled samples, Journal of Machine Learning Research 7(1) (2006), 2399–2434.

20.

Niu

X.X.

and Suen

C.Y.

, A novel hybrid CNN-SVM classifier for recognizing handwritten digits, Pattern Recognition 45(4) (2012), 1318–1325.

21.

Scholkopf

, Smola

A.J.

, Willianson

R.C.

, Bartlett

P.L.

, New support vector algorithms, Neural Computation 12(5) (2000), 1207–1245.

22.

Saunders

, Stitson

M.O.

and Weston

, Support vector machine, Computer Science 1(4) (2002), 1–28.

23.

Tong

and Koller

, Support vector machine active learning with application to text classification, Journal of Machine Learning Research 2(1) (2002), 45–66.

24.

Amari

and Wu

, Improving support vector machine classifiers by modifying kernel functions, Neural Networks 12(6) (1999), 783–789.

25.

Chapelle

, Training a Support Vector Machine in the Primal, Neural Computation 19(5) (2007), 1155–1178.

26.

Rao

, Dong

C.X.

and Yang

S.Q.

, An Intrusion Detection System Based on Support Vector Machine, Journal of Software 14(4) (2003), 798–803.

27.

Chen

Y.A.

, Lin

Y.L.

and Chang

L.W.

, A systolic algorithm for the k-nearest neighbors problem, IEEE Transaction on Computers 41(1) (1992), 103–108.

28.

Beliakov

and Li

, Improving the speed and stability of the k-nearest neighbors method, Pattern Recognition Letters 33(10) (2012), 1296–1301.

29.

Tan

S.B.

, Neighbor-weighted k-nearest neighbor for unbalanced text, Expert Systems with Application 28(4) (2005), 667–671.

30.

Weinberger

K.Q.

and Saul

L.K.

, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research 10 (2009), 207–244.

31.

Tolomei

and Silvestri

, Generating actionable interpretations from ensembles of decision trees, IEEE Transactions on Knowledge and Data Engineering 33(4) (2021), 1540–1553.

32.

Maaten

and Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research 9(4) (2008), 2579–2605.

33.

Jia

Q.L.

, Zhang

Y.W.

and Chen

, Simultaneous fault detection and isolation based on transfer semi-supervised nonnegative matrix factorization, Industrial and Engineering Chemistry Research 58(19) (2019), 8184–8194.

34.

Chen

K.X.

, Yao

L.N.

, Zhang

D.L.

, Wang

X.Z.

, Chang

X.J.

and Nie

F.P.

, A Semisupervised Recurrent Convolutional Attention Model for Human Activity Recognition, IEEE Transactions on Neural Networks and Learning System 31(5) (2020), 1747–1756.

35.

Luo

M.N.

, Chang

X.J.

, Nie

L.Q.

, Yang

, Hauptmann

and Zheng

Q.H.

, An Adaptive Semisupervised Feature Analysis for Video Semantic Recognition, IEEE Transactions on Cybernetics 48(2) (2018), 648–660.

36.

Zhang

D.L.

, Yao

L.N.

, Chen

K.X.

, Wang

, Chang

X.J.

and Liu

Y.H.

, Making Sense of Spatio-Temporal Preserving Representations for EEG-Based Human Intention Recognition, IEEE Transactions on Cybernetics 50(7) (2020), 3033–3044.

37.

Liu

J.H.

, Zhong

Q.H.

, Yuan

, Su

and Du

, SemiText: Scene text detection with semi-supervised learning, Neurocomputing 407 (2020), 343–353.

38.

Fang

S.C.

, Xie

H.T.

, Wang

Y.X.

, Mao

Z.D.

, Zhang

Y.D.

Read Like Humans Autonomous Bidirectional and Iterative Language Modeling for Scene Text Recognition, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021; arXiv:2103.06495.

39.

, Chen

, Zhang

, Liu

J.H.

, He

F.X.

, Wang

C.Y.

, Du

Visual Semantics AllowforTextual Reasoning Better in SceneText Recognition, Association for the Advancement of Artificial Intelligence, 2021 Dec; arXiv:2112.12916.

40.

, Ye

, Zhang

, Liu

, Tao

I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection, IJCV, 2022 Apr; arXiv:2108.01343.

41.

Bhunia

A.K.

, Ghose

, Kumar

, Chowdhury

P.N.

, Sain

, Song

Y.Z.

MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021 Apr; arXiv:2104.01876.