A convolutional neural network model of multi-scale feature fusion: MFF-Net

Abstract

MFF-Net (a multi-scale feature fusion convolutional neural network) was designed to improve the recognition rate of handwritten digits. The low-level, middle-level and high-level features of the image were first extracted through the convolution operation, and then the low-level and intermediate features were further extracted through different convolutional layers, later directly fused with the high-level features of the image with a certain weight, and then processed by the full connection layer. By adding a batch normalization layer before the activation layer, and a dropout layer between the full connection layers, the accuracy and generalization capacity of the network are improved. At the same time, a dynamic learning rate algorithm was designed, with which, the trained network accuracy was significantly improved as shown in the experiments on the MNIST data set. The accurate rate could reach 99.66% through only 30 epochs training. The comparison indicated that the accuracy of the network model is significantly higher than that of others.

Keywords

Convolution MNIST activation layer dynamic learning dropout

1. Introduction

Handwritten digit recognition has wide range of applications in the era of big data especially in the fiscal, taxation, financial and other fields which need handwritten records and expect improved work efficiency through the realization of automatic recognition with the model. Thus, it is of great theoretical significance in the field of pattern recognition. Many scholars at home and abroad have proposed different algorithms for handwritten digit recognition and have achieved remarkable results. Zhong et al. [1] proposed a multi-scale neural network algorithm based on CNN, which directly conveys multiple features to the full connection layer and the Softmax classifier by dint of convolution and pooling operations, and finally achieves an accuracy of 99.20%. Zhao et al. [2] realized the recognition of handwritten digital images by calculating the shortest distance between the vectors, and achieved a recognition rate of more than 95%. Chen et al. [3] proposed a fusion convolutional neural network (F-CNN) model, which integrates the low-level, middle-level and high-level features, expands the size of the high-level network layer, then enhances the expression ability of the F-CNN model, and finally achieves a recognition rate of 99.10%. Zeng et al. [4] with a combination of an autoencoder and a convolutional neural network, designed a deep convolutional autoencoder neural network, which has achieved a recognition accuracy rate of 99.37%. Zhou et al. [5] designed an adaptive enhancement perceptron network mode with a multilayer perceptron as a base classifier, which uses an adaptive enhancement algorithm as an integration strategy and achieves an accuracy of 94.21%. Although the above-mentioned network models might provide promising recognition rate, the recognition rate is still far from ideal, and there is still room for improvement. In this study, we designed a multi-scale feature fusion convolutional neural network, and when training with a dynamic learning rate algorithm, we achieved an accuracy rate of 99.66%, which is 5.45 percentage points higher than that mentioned in the fifth reference.

2. MNIST data set

The MNIST data set is a widely used in the field of handwritten character recognition, which originates from the National Institute of Standards and Technology (NIST) [6]. The training set has handwritten digits from 250 different people, of which 50% are high school students, and the rest are from the Census Bureau. The test set has the same handwritten digit data source as the training set and it has a total of 60,000 samples, including 50,000 training set and 10,000 test set samples [7, 8, 9]. Each piece of data in the data set is composed of a digital image and a label. and each image is 28 by 28. Figure 1 shows 24 different forms of number 6 randomly selected in this data set.

Figure 1.

Different forms of number six.

3. Algorithm design

3.1 MFF-Net network model

PyTorch encapsulates the MNIST data set. Each piece of data is composed of a three-dimensional tensor image (1, 28, 28) and a label, which can be directly downloaded with Downloader.

The MFF-Net (Multi-scale Feature Fusion Net) network will add a batch standardization layer before each activation layer to speed up the convergence of the network and slow down the disappearance of the gradient, of which the batch standardization calculation is determined by Eq. (5). Equation (1) represents the number of this batch samples, ${u}_{B}$ the average of the samples, ${\sigma}_{B}^{2}$ the sample variance. While in Eq. (4), $\hat{x}$ obeys the standard normal distribution, and $\varepsilon$ is a small positive number which is adopted to avoid the possibility of denominator being 0. And the negative effects of BN can be offset by the $\alpha$ and $\beta$ parameters learning in Eq. (5).

$\displaystyle B=x_{1}\cdots x_{n}$ (1) $\displaystyle u_{B}=\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}x_{i}$ (2) $\displaystyle{\sigma}_{B}^{2}=\frac{1}{{n}}\mathop{\sum}\limits_{{i}=1}^{n}({{% x}-{u}_{B}})^{2}$ (3) $\displaystyle\hat{x}=\frac{x_{i}-u_{B}}{\sqrt{\sigma_{B}^{2}+\varepsilon}}$ (4) $\displaystyle{BN}_{{\alpha},{\beta}}(x_{i})=\alpha\hat{x}+\beta$ (5)

The images processed by the batch normalization layer directly use the Relu activation function, and the calculation method is shown in Eq. (6).

$\displaystyle\text{Relu}(x)=\left\{{{\begin{array}[]{*{20}c}{{x}\quad{x}% \geqslant 0}\\ {0\quad{x}<0}\\ \end{array}}}\right.$ (6)

Figure 2.

MFF-Net network model structure.

The MFF-Net network model is designed with nine layers in total, and the network model structure is shown in Fig. 2. The first layer of the model is the convolutional layer Conv1, which uses a 3 by 3 convolution kernel with a sliding step of 1 to extract low-level features of the image. The second layer is a convolutional layer Conv2, which uses a 3 by 3 convolution kernel with a sliding step of 1 to extract the intermediate features of the image. The third layer is the maximum pooling layer Max-Pool1 with a 2 by 2 pooling window and a sliding step of 2, and it is used to down-sample the image and reduce the dimensionality. The fourth layer is the convolutional layer Conv3, and the convolution kernel with a sliding step of 1 and the kernel size of 3 by 3 is used to extract the advanced features of the image. The fifth layer is the convolutional layer Conv4, and the kernel size is 3 by 3 with a sliding step of 1, which further extracts the advanced features of the image. The sixth layer is the pooling layer Max-Pool2 to further down-sample the image. The seventh layer is the full connection layer Fc with 2048 neurons and the eighth layer is the full connection layer Fc2 which has 1024 neurons. The ninth output layer is Fc2 with 10 neurons. The convolutional layer Conv1f and Conv2f extract the low-level and middle-level features of the image, and sent directly the features to the fusion layer with a certain weight, which could effectively prevent the fast loss of image features and solve the problem of fast gradient disappearance. The feature fusion method uses Eq. (7), where $\textit{Out}_{1f}$ is the output of the Conv1f layer, $\textit{out}_{2f}$ is the output of the Conv2f layer, $\textit{Out}_{\max-\textit{pool}2}$ is the output of the Max-Pool2 pooling layer. Out is the output after fusion. ${\alpha},{\beta}$ the weight to be learned by the model, ${\gamma}$ the bias term, which are parameters used to adjust the weight of low-level features and middle-level features in the fusion, and minimize the negative impact of low-level and middle-level features on the model. When ${\alpha},{\beta}$ , and ${\gamma}$ are all 0, MFF-Net degrades to the ordinary convolutional neural network.

$\displaystyle\textit{Out}=\textit{Out}_{\max-\textit{pool}2}+\alpha\cdot% \textit{Out}_{1f}+\beta\cdot\textit{Out}_{2f}+\gamma$ (7)

In addition, a Dropout layer with a probability of 0.5 is added between every two layers of Fc1, Fc2, and Fc3 of the MFF-Net network model to further improve the generalization ability and recognition accuracy of the network model.

3.2 Loss function

The loss function adopts cross-entropy loss. It is a good measure of the difference between two different probability distributions of the same random variable, which is expressed as the difference between the true probability distribution and the predicted probability distribution in machine learning [10, 11]. The smaller the value of cross entropy is, the smaller the difference between the true probability distribution and the predicted probability distribution is, and the better prediction effect the model has. When use Cross Entropy Loss to calculate Loss and use gradient descent to solve, the convex optimization problem has pretty good convergence characteristics. The mathematical expression of cross entropy is Eq. (8), where p represents the true probability and q represents the predicted probability.

$\displaystyle{H}({p,q})=-\mathop{\sum}\limits_{i=1}^{m}p({x_{i}})\log({q({x_{i% }})})$ (8)

Figure 3.

Flow chart of dynamic learning rate algorithm.

3.3 Dynamic learning rate algorithm

In order to improve the accuracy of the model, the author designed a dynamic learning rate algorithm. Experiments showed that this algorithm for training could effectively improve the accuracy of the model. The algorithm flow chart is shown in Fig. 3, in which, the accuracy refers to the average accuracy rate of an epoch on the test set. The algorithm is described as follows:

(1)
To initialize lr: initialization is 0.1.
(2)
To decide whether the accuracy rate decreases twice continuously. If YES, skip to step 3; if NO, continue the training with the current lr.
(3)
lr $=$ lr/2, if the training continues, skip to step 2, or else, the training ends.

Numerous experiments and comparison show the learning rate adjustment algorithm in this paper achieves better performance. That’s shown in Fig. 4.

Figure 4.
Learning rate adjustment method.

4. Experiment and analysis

4.1 Experimental environment

For the hardware device, the computer has a CPU of Intel I5-10600KF with the 4.1 GHZ main frequency, a graphics card of GALAXY NVidia 3060. The computer is also equipped with Ubuntu20.04, 64-bit operating system, PyTorch 1.8, deep learning framework, CUDA11.2 parallel computing library and Python3.8.

4.2 MFF-Net network model parameters

In the training process, each batch will disrupt the order of the data and when each epoch training is completed, a prediction is made on the test set, on which, the accuracy rate on the test set is based. The dynamic learning rate algorithm adjust the learning rate of the model in line with the accuracy changes. The network model parameters are shown in Table 1.

Table 1
MFF-Net network model parameter configuration table

Name of layer	Input image size	Output image size	Kernel size
Conv1	28 $\times$ 28@1	26 $\times$ 26@16	3 $\times$ 3@16
Conv2	26 $\times$ 26@16	24 $\times$ 24@32	3 $\times$ 3@32
Max-Pool1	24 $\times$ 24@32	12 $\times$ 12@32	2 $\times$ 2
Conv3	12 $\times$ 12@32	10 $\times$ 10@64	3 $\times$ 3@64
Conv4	10 $\times$ 10@64	8 $\times$ 8@128	3 $\times$ 3@128
Max-Pool2	8 $\times$ 8@128	4 $\times$ 4@128	2 $\times$ 2
Conv1f	4 $\times$ 4@128	16 $\times$ 16@8	3 $\times$ 3@8
Conv2f	16 $\times$ 16@8	8 $\times$ 8@32	3 $\times$ 3@8
Fc1	8 $\times$ 8@32	2048
Fc2	2048	1024
Fc3	1024	10

4.3 Algorithm comparison and analysis

4.3.1 Comparison of the two training methods

The model took 539.24 seconds to finish 30 Epoch trainings and finally reached an accuracy of 0.9966 on the test set. The initial network learning rate lr is set to 0.1, the Dropout probability 0.5, ${\alpha}$ and ${\beta}$ initialized to 0.5, ${\gamma}$ 0, and batchsize 16.

Figure 5.

The accuracy of the two training methods on the training set.

Figure 6.

The loss of the two training methods on the training set.

In the experiment, the network model was trained and tested under the fixed learning rate and the automatic learning rate. The results show that the performance of the network could be significantly improved when the automatic learning rate was used for training. Figure 5 shows the relationship between the prediction accuracy and Epoch based on the training set. It can be seen that from the first to the 14th Epoch, the accuracy of the two training methods is basically the same, and from the subsequent 15th to the 30th Epoch, the training accuracy of the automatic learning rate is higher than that of the fixed learning rate. Figure 6 shows the loss of the two training methods on the training set. It is easy to find, from the 1st to the 14th Epoch, the loss of the two training methods is basically the same and from the following 15th to the 30th Epoch, the training loss using the automatic learning rate method is less than that of the fixed method. Figure 7 indicates the accuracy of the two training methods on the test set. From the image, when the automatic learning rate method is used for training, the accuracy fluctuates significantly at the beginning, but as the training deepens and the learning rate is adjusted, the shock tends to be stable, and the accuracy is further improved, while when the fixed learning rate method is used for training, the entire training process oscillates significantly and its accuracy rate is significantly worse than that of the training using the automatic learning rate method.

Table 2

The accuracy of the two training methods on the test set

Name	Average accuracy	Max accuracy	Final accuracy
Fixed lr	0.9937	0.9953	0.9941
Auto lr	0.9952	0.9968	0.9966

Table 3

Recognition accuracy of different algorithms

Algorithms	Data set	Accuracy
Reference [1]	MNIST	99.2%
Reference [2]	MNIST	95.0%
Reference [3]	MNIST	99.1%
Reference [4]	MNIST	99.37%
Reference [5]	MNIST	94.21%
Reference [12]	MNIST	99.34%
Reference [13]	MNIST	99.0%
Reference [14]	MNIST	98.5%
In this paper	MNIST	99.66%

Figure 7.

The accuracy of the two training methods on the test set.

Figure 8.

Comparison of accuracy of the two training methods on the test set.

Table 2 indicates the average accuracy, the maximum accuracy and the final accuracy of 30 epochs obtained by the two training methods on the test set. It can be seen from the data in the table and Fig. 8 that the training of automatic learning rate has edge over that of fixed learning rate training.

4.3.2 Comparison with existing algorithms

In this study, we compare the accuracy of the network model MFF-Net with other algorithms. It can be seen from the Table 3 that among the other eight algorithms, the algorithm referred in the fifth reference has the highest accuracy, being 99.37% and that in the sixth reference has the lowest accuracy rate of 94.21%. While the accuracy rate of MFF-Net is 96.66%, 5.45 percentage points higher than that proposed in Reference No. 6, and 0.29 percentage points higher than that mentioned in Reference No. 5. It is not hard to find that MFF-Net could effectively improve the accuracy of the network by fusing multiple features of the image.

5. Conclusion

In this paper, we designed a multi-scale feature fusion method featured by the fusion of the low-level, middle-level and high-level features of the image with a certain weight, which is automatically learned by the network, the addition of batch normalization layer before the activation layer, and a dropout layer between the full connection layers, which greatly speed up the convergence of the network and improve the generalization ability. At the same time, an automatic learning rate algorithm is designed used for training, which adjusts the learning rate dynamically in accordance with the result on the test data set and under which, training can improve the recognition accuracy of the network relatively smoothly. The experimental results show that the network model can reach a recognition rate of 99.66%, 0.29 to 5.45 percentage points higher than that proposed in References No. 2-No. 9. There is still a certain room for improvement in its accuracy for only 30 Epochs are trained. It is expected, with this model, automatic recognition could be realized in fiscal, taxation, financial and other fields which need handwritten records so as to save time and improve work efficiency. The future study will explore a combination of MFF-NET and Google Inception models, which can better extract features, thus, greater recognition performance can be anticipated.

Footnotes

Acknowledgments

This work was partly supported by the General project of Anhui Natural Science Foundation (No. 1808085MF171).

References

Zhong

Xie

Liu

. Application of multi-scale features based on CNN in handwritten digit recognition. J Mianyang Norm Univ. 2019; 38(11): 22-26.

Zhao

Liu

Yan

. Research on handwritten digit recognition based on KNN algorithm. J Chengdu Univ (Nat Sci Ed). 2017; 36(4): 382-384.

Chen

Zhu

Wang

. Handwritten digit recognition based on fusion convolutional neural network model. Comput Eng. 2017; 43(11): 187-192.

Zeng

Meng

Guo

. Research on handwritten digit recognition based on deep convolutional autoencoding neural network. Comput Appl Res. 2020; 37(4): 1239-1243.

Zhou

Fan

Hai

. Handwritten digit recognition based on deep ensemble learning. J Shaanxi Univ Technol (Nat Sci Ed). 2020; 36(3): 47-53.

Real-Time Handwritten Digit Recognition [Internet]. Available from: https://blog.csdn.net/wlx19970505/article/details/80267284.

Alvear-Sandoval

Sancho-Gomez

Figueiras-Vidal

. On improving CNNs performance: The case of MNIST. Inf Fusion. 2019; 52: 106-109.

Palvanov

Cho

. Comparisons of deep learning algorithms for MNIST in real-time environment. Int J Fuzzy Logic Intell Syst. 2018; 18(2): 126-134.

Baldominos

Saez

Isasi

. A survey of handwritten character. Recognition with MNIST and EMNIST. Appl Sci-Basel. 2019; 9(15): 3169.

10.

Teow

MYW

. A minimal convolutional neural network for handwritten digit recognition. In: Proc of the 7th IEEE International Conference on System Engineering and Technology. Piscataway, NJ: IEEE Press; 2017. pp. 171-176.

11.

Shopon

Mohammed

Abedin

. Image augmentation by blocky artifact in deep convolutional neural network for handwritten digit re-cognition. In: Proc of IEEE International Conference on Imaging, Vi-sion & Pattern Recognition. Piscataway, NJ: IEEE Press; 2017. pp. 1-6.

12.

Wang

Duan

Xue

. Handwritten digit recognition under the background of edge intelligence. Comput Appl. 2019; 39(12): 3548-3555.

13.

Tabik