Abstract
The classification of fresh tobacco leaves during the picking process plays an important role in the subsequent roasting. In this paper, a lightweight convolutional neural network is used to detect the maturity of tobacco leaves quickly. Fresh tobacco leaves in the datasets are divided into 3 categories by the picking position, and each category is divided into 4 maturity levels and finally gets 12 types of tobacco leaves with different maturity. To ensure the lightweight of the model, the new network is based on the MobileNetV2 to establish. By utilizing shortcut operation, the shallow network information is preserved, and network degradation is suppressed. In the tobacco leaf datasets we obtained, the improved network has superior performance and compared with other classic networks, the model size and the number of operations have been reduced.
Introduction
The classification of fresh tobacco leaves plays an important role in subsequent roasting. Fresh tobacco leaves can determine different maturity levels through some features such as texture, color, shape, and size [1]. For a long period, the classification of tobacco leaves after picking is mainly done by local employees based on their own experience. From this perspective, in many cases, some tobacco leaf classification experts will be unable to work due to long-term fatigue, or other emotional influences that cause tobacco leaf classification errors [2]. This process caused inaccurate grading of fresh tobacco leaves and caused trouble for follow-up work.

Original picture display.
In recent years, machine learning methods are developing extremely rapidly and related researchers have previously proposed some machine learning methods for identifying the maturity of tobacco leaves. In previous studies, researchers used to extract features (such as color) of tobacco leaves in advance to build feature vectors and input them into machine learning algorithms. Among them, the SVM algorithm is widely used in distinguishing different types of objects. The reason why the SVM algorithm is attractive for classification is that it only requires a few training samples to get a hyperplane to distinguish different classes [19–21]. Support vector machine was used to solve the binary classification problem in the early stage, but multi-classes problems are commonly encountered, one of the common practices is to convert the multi-classification problem into multiple groups of binary classifications [22]. Another key issue is the features of training samples need to be extracted by researchers themselves. The way of extracting features is greatly affecting the accuracy of the final classification. After 2012, through the release of the AlexNet network architecture [3], the convolutional neural network has become more and more widely used in image classifications. Compared with traditional machine learning methods, convolutional neural networks can automatically extract training sample features through multi-layer convolution. However, different from SVM, convolutional neural networks require more datasets to perform the parameters iteration then fit the datasets. In this paper, we focus to construct a lightweight convolutional neural network and improve the accuracy of the final image classification as much as possible. The main content of this article is arranged as follows. The second part reviews the previous research and explains the direction for improvement. The third part introduces the source of relevant data and some pre-methods for the dataset. The fourth part introduces the improvement method and the theoretical basis of the improvement of the model structure. The fifth part proposes the improved structure and compared it with several other structures. The sixth part shows the model structure, and the seventh part compares the performance of each model. Relevant conclusions are obtained in the eighth part. A discussion section was added at the end of the article, which explained some of the problems found in this research and possible future improvements.
In terms of identifying the maturity of tobacco leaves, some researchers have previously conducted research. Due to previous hardware equipment, industry technology level limitations, many of these researchers focus on using traditional machine learning methods. Wang J proposed a method based on sparse coding to detect tobacco leaf maturity in 2014. He introduced an unsupervised learning method to detect tobacco leaf maturity [4]. Tian, KL establish 4 methods to detect the maturity of tobacco leaf [5]. One of these methods is that she uses visible/near-infrared spectroscopy to extract spectral features to distinguish tobacco leaves of different maturity levels. Other methods in the paper also need to use different equipment to extract features first. In 2016, Bin J et al. used random forest, SVM, and other methods to classify tobacco leaves by extracting the spectral information of different grades of tobacco leaves [6]. In these previous studies, researchers often use machine learning algorithms such as SVM. Detect the maturity of tobacco leaves through traditional machine learning methods, the researchers need to decide which extractions are needed to extract and these extractions whether great influence the accuracy and model size. For this way of extracting features by researchers, some key features may be lost during the feature extraction stage or the model size is huge due to too many redundancy features being extracted. In previous research, some researchers also use CNN to class different classifications of tobacco leaf maturity. Before adding the structure of shortcuts, the network is prone to degradation because the network has too many layers. In addition, in previous studies, researchers often used specific equipment to collect images instead of on-site pickers’ mobile phones, cameras, and other equipment, which would have a certain negative impact on the generalization ability of the established model. Therefore, to solve the above problems and satisfied the actual picking needs, this article has made targeted improvements based on MobileNetV2 to quickly identify the maturity of tobacco leaves.

HSV and LBP features.
Fresh tobacco use in this work is from the tobacco company of QUJIN CITY YUNNAN PROVINCE CHINA. The sample is YUNYAN87 with 1200 images in 12 categories. According to growth location, these samples are divided into the following categories: top, middle and low. And according to maturity level, each growth location of leaves has been divided into ripe, under-ripe, fake ripe, and overripe. So, these data are divided into 12 categories, each of which contains about 100 images. All the photos were collected by the local staff of the company.
Data distribution and feature extraction
The datasets are divided into three parts, training datasets, validation datasets, and test datasets. The ratio is 7:2:1. After completing the datasets division, we have performed data augmentation for each dataset. These data augmentation methods are translation, rotation at different angles, and affine transformation. After data augmentation, the train datasets contain 5040 images, validation datasets contain 1440 images and test datasets contain 720 images. For the support vector machine method, we extract the LBP feature and HSV feature

BN Processing.
The batch normalization was used in the training process of deep learning methods [7]. The function of this mechanism makes the optimization landscape smoother [10]. The following will show the mathematical process of BN processing in the m layer.
Suppose that the input of layer
Then the mean and variance:
Calculate the normalized input value
Send the input value processed by
Linear bottleneck
The framework of the network built this time refers to the Linear Bottleneck layer [8]. In the previous bottleneck architecture of MobileNetV1, after the low dimensional input enters in activation tensor, a certain amount of information will be lost, and in higher-dimensional spaces, the information can be conserved after Relu activates function [8]. To solve this problem, the way of researcher’s approach is to increase the input dimension before inputting it into the activation function.
Inverted residuals
The structure of inverted residuals is based on the linear bottleneck structure. The difference from the standard residuals is the way that linear residuals take the action of ascending dimension first and then reducing the dimension after depthwise convolutional layer. The addition of the shortcut branch makes the occurrence of network degradation well suppressed [14].
Depth wise separable convolution
In the field of image classification, CNN (Convolutional Neural Network) is a commonly used network structure. On GPU computing devices with CUDA acceleration, this convolution kernel can update parameters at a faster speed. However, for mobile devices, the weakened compute ability can’t complete the derivation and calculation of many parameters within a short period. Therefore, when we designed the CNN network structure, DW convolution [23] was selected to replace standard convolution. However, DW convolution itself cannot increase or decrease the number of feature maps. When the number of feature maps needs to be increased or decreased, PW convolution will be added before and after DW convolution to increase or reduce dimensionality. The PW convolution itself is a standard convolution, and its convolution kernel size is 1×1 [23]. Shown below is the operation process of DW convolution.
Compared with standard convolution, DW convolution performs much fewer operations [23]. Consider a standard convolution layer takes as input a D
c
×D
c
×M feature map
Assumed that the size of DW convolution kernel
PW convolution layer consists of traditional convolution with a size of 1×1. The channel numbers of the PW convolution layer are N. so the PW convolutions have the computational costs of:
The computational cost of DW+PW is:
Then
The size of DW convolution is 3×3, so the computational costs of PW+DW approximately equal to 1/9 computational costs of standard convolution.
Related work
Hypothesis
When we test the accuracy of MobileNetV2 in our test datasets, the accuracy rate will eventually remain around 91%. But when we delete some structures of the network, the accuracy has increased. Therefore, we guess that the network degradation may have occurred.
Solution
The deeper network is generally considered to be able to fit the shallow network, but in actual operation, it has been found that there are two obstacles. The first case is that the gradient may explode or vanish [11], but this situation has been alleviated after the process of normalized initialization [9, 24] and intermediate normalization layers [7], these ways enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [12]. Another situation is that there is network degradation [13, 25]. The degradation (training accuracy) indicates that not all systems are similarly easy to optimize [14]. For a feedforward neural network, each layer of neurons is only connected to the neurons of the previous layer, and not connected to the input of other layers. Here we first define an L-layer feedforward neural network: f (x ; W). For this L-layer network, the input x = a0, for the middle layer: m, the input: z
m
= H (am-1), among them (1 < m < L). So, we can get the equation a
m
= A (z
m
). In the past, people guessed that the deep network can fit the shallow network because, for the deep network of the layer
The
Suppose the activation function A (x) = 1, so we can get:
For the layer of m and n:
Therefore, this method directly expresses the high-level network by the low-level network through the residual structure, which largely solves the problem of network degradation [15].
After changing the original structure of MobileNetV2, we found that the bottleneck structure with the output layer of 5 had a greater impact on the accuracy in test datasets, so this layer was changed. Since the input and output dimensions and shapes are the same when performing shortcut connections, two improvement ideas have been proposed. The first structure is the same as shown in Fig. 5 after modification. The first group of residual structure stride = 1, so that the input and output meet the additional condition. The second way is to process the input part, using a 1×1 convolution kernel to change the dimension, and stride = 2 to change the shape.

The difference between MobileNetV1 and MobileNetV2.

Inverted Residuals.

The work way of DW convolution.

The work way of standard convolution.

Improve structure comparison.

HOG feature.
The table shows the structure of MobileNetV2, which has 21 layers. When S = 2, the first bottleneck of each group has no shortcut connection
The table is the detail of the MTN1 structure. The specific method is to change the stride of the DW convolution kernel in the fourth bottleneck to 1, and then connect the input and output of the layer by shortcut. Under this structure, the network has 21 layers
Comparative results
Training settings
MobileNetV2, ResNet-34, VGG16Net, and the improved two networks (MNT1, MNT2) are used for comparison. Iterate 100 Epochs to update the parameters in the training datasets and use the validation datasets to check whether the model has overfitting. After that, use the test datasets to get the classification accuracy on the trained model
The table is the detail of the MTN2 structure. This structure is completed by the shortcut connection of the two major groups of Bottleneck. This connection method allows the information of the fourth layer of the network to be directly transmitted to the input of the eighth layer, which reduces the phenomenon of network degradation
The table is the detail of the MTN2 structure. This structure is completed by the shortcut connection of the two major groups of Bottleneck. This connection method allows the information of the fourth layer of the network to be directly transmitted to the input of the eighth layer, which reduces the phenomenon of network degradation
The datasets are divided into training set validation set and test set, we resize all images to a fixed resolution of 224×224. For several deep learning methods, to facilitate the training and enhance model generalization performance, we performed several pre-transformation operations on the image. For training datasets, the image is randomly cropped and enlarged to a size of 224×224, random horizontal flip subsequently and subtract the mean in each channel to normalize the image. For validation and test datasets, we subtract the mean in each channel to normalize these datasets. For the support vector machine method, we extract the LBP feature and HSV feature and use 3 kinds of features as input. The first is to use LBP as input feature separately, the second is HSV feature separately and the third is to integrate the above features as input.
Comparisons of MTN1, MTN2 and other classical network framework on test datasets
Comparisons of MTN1, MTN2 and other classical network framework on test datasets
Table5
After the data comparison of the table, we found that the accuracy of classification using deep learning is much higher than that of traditional machine learning. Compared to MobileNetV2, MTN1, and MTN2, parameters and the required amount of computation of ResNet34 and VGG16 are very large. The number of parameters in MTN1 and MTN2 is 2.2 million. The number of operations measured by multiply-add is 1.93 G in MTN1 and 626.36 M in MTN2. In the comparison table, the MTN2 reaches an accuracy of 95.14%.
Conclusion
This paper proposes two improved models based on MobileNetV2 to quickly identify the maturity of tobacco leaves. We aim to reduce the size of the model as much as possible while improving the accuracy and reduce the number of calculations. The result on the test set shows that the improved network structure MTN1 and MTN2 have achieved the balance between accuracy, model size, and calculation complexity. Therefore, we believe that this model can be applied to some mobile devices for rapid detection of tobacco leaf maturity.
Discussion
Our result shows that in the case of extracting LBP features and HSV features, the accuracy rate of the support vector machine reached 81.04%. Based on these features, the SVM method has a lower classification accuracy than other deep learning methods, but it can use less time. We found that in the test datasets, the Linear SVM method using the above features takes less than one second to predict all images. In contrast, MobileNetV2, which has the lowest computational load, takes about 18 seconds to classify the above pictures. Then we try to extract the HOG [28] feature of the image and fuse the other two features as input features for training. After getting the trained model, the accuracy on the test set reached 95%, However, the time spent on the test set exceeded 100 seconds, and the storage space occupied by the features of the test set alone reached 21 GBytes contrast, the storage space occupied by the HSV feature and the LBP feature is only a few Mbytes. In addition, we also attempt to use some more complex Convolutional Neural Networks for image classification, such as ResNet50, ResNext50 for training.
A high accuracy rate has been achieved in the final experimental results, but we think that there may be some problems in actual use. In the data preparation stage, to enhance the generalization performance of the network model, we adopted a data augmentation method to increase the size of the datasets. However, data enhancement cannot completely simulate the actual picking environment. For example, it is shown that during the picking process, the leaves cannot be placed separately like the leaves in the datasets. Therefore, more research may be needed in the field of actual use.
