Abstract
Based on the impact of epidemic prevention and control, the floating population supervision department classifies and controls the floating population by industry. There are many personnel management and control points. When the computer-aided management system is used, the outdoor environment is complex and the data interference is large. Therefore, the recognition accuracy of outdoor scenery is required to be higher. In this paper, a convolutional neural network with adaptive weights is proposed. In this method, the feature fusion strategy is combined with the network, and the optimal feature weight is obtained by training the network. In addition, this paper uses multiple two classifiers instead of multiple classifiers to achieve accurate target classification. Experiments show that the method proposed in this paper has excellent performance in the detection of similar objects. The strategy of replacing multi classification network with multi classification network improves the accuracy and recall of target detection in known environment.
Introduction
With the development of target detection methods, target recognition accuracy and recall rate have been improved. The rapid development of graphics processing unit (GPU) makes the detection more real-time [1, 2]. However, the upgrading of network structure has limited effect on the improvement of the recall rate, especially target overlapping in the complex urban environment, the problem of low recall rate cannot be solved fundamentally, and the target detection method still has a lot of room for improvement. For example, in many scenarios, people want to be able to detect all the targets, that is, the recall rate is close to 100%, but the current detection method is obviously difficult to achieve [3]. For example, in terms of streetscape recognition, traffic safety and automatic driving, the target detection method has been applied to predict the potential danger. The system obtains the information around the vehicle through cameras and sensors, detects the targets around the vehicle, and makes lane departure warning and collision warning to the driver in time. However, this method can only play an auxiliary role in early warning, because the rate of missing detection is still very high, especially in the complex road section, there is still a certain gap between the target detection distance and the human automatic driving [4, 5]. In the aspect of outdoor design scene recognition, people can easily find and locate the outdoor location through deep learning algorithm, and can also identify the types of outdoor design scene with high accuracy. However, in a complex outdoor environment, even if there is a slight deviation in the detection, it will have a great impact on the design and is also very irresponsible to the designer. In addition, the main focus of the target detection method is the target recognition, and the target location is only an auxiliary step [6]. Using the target detection method alone can only obtain the approximate position of the target in the image. If it is applied to real life, it is also very important to obtain the three-dimensional coordinates of the target in the world coordinate system [7, 8]. Based on the depth learning method, this paper introduces a binocular vision system to calculate the position coordinates of the target, and further improves the accuracy and recall rate of the target detection in the outdoor design scene recognition through prior knowledge as a constraint condition.
SENET network
In recent years, convolutional neural networks have made great breakthroughs in many fields. However, it is quite difficult to learn a network with very strong classification ability. The difficulties come from many aspects. With the deepening of research on neural networks by researchers, more and more methods have been proposed to improve various performance indexes of the network. One of them is the Inception network, which adds multi-scale information into the network structure and combines it with the characteristics of multiple receptive fields to improve the identification ability of the network. Inside outside network is to analyze spatial context information and add attention mechanism in spatial dimension. The squeeze and exception network (hereinafter referred to as SENET) proposed by Hu Jie and others considers the relationship between feature channels [9], and the accuracy of network classification has been greatly improved.
In the structure of SENET, squeeze and exception are two very critical operations, and the network is named after them. The purpose of SENET network is to hope that more important features can have greater weight, while secondary features reduce their weight. The network does not introduce a new spatial dimension to fuse feature channels, but adopts a brand-new strategy of recalibrating the weight of feature maps [10, 11]. The network gradually obtains the importance of each feature map through continuous iteration and training, and then uses the obtained importance (i.e., feature map weight) to promote useful features and suppress features that are less useful for the current task.
Figure 1 is a schematic diagram of the SENET network module. Suppose the input is x and the number of characteristic channels is c1. After a series of convolution, pooling and other transformations, the number of characteristic channels becomes c2. Compared with the traditional convolution neural network, the next three steps are used to re calibrate the features. The first is the compression operation, which turns each two-dimensional feature channel into a real number, which has a global receptive field to some extent, and the output dimension matches the number of input feature channels. It represents the global distribution of the response on the feature channel, and enables the layer near the input to obtain the global receptive field, which is very useful in many tasks. Next is the excitation operation, which is similar to the gate mechanism in the cyclic neural network. Weight is generated for each feature channel by parameter ω. Finally, it is an operation to recalculate the weight. The weight of the output after excitation is regarded as the importance of each feature channel after feature selection [12]. Finally, the weight of the feature map is recalibrated by weighting the previous features one by one through multiplication.

SENET network module.
The specific implementation uses the global average pooling as the compression operation, and the global average pooling is shown in Fig. 2. The left figure is a conventional convolution neural network. First, all the feature maps are connected into one-dimensional vectors. Then, through the full connection layer and softmax classification, the right figure is the global average pooling. The feature map of the last layer is pooled into a mean value of the whole map to form a feature point. These feature points constitute the final feature vector, which can greatly reduce the number of parameters.

Convolution neural network and global mean pooling.
After the global average pooling layer, the correlation between the two fully connected layers is modeled, and the weight of the same number of output and input features is output. Firstly, the feature dimension is reduced to 1 / 16 of the input, and then it is promoted to the same level through a full connection layer after being activated by Relu. Compared with using only one full connection layer, this method has the following advantages: The network has more non-linearity, which can better fit the complex correlation between channels; in addition, the reduction of dimension can greatly reduce the computation of the network. Then the weight is reduced to 0 ∼ 1 by using the sigmoid function. Finally, the weight is weighted to each channel’s feature by multiplying each feature graph with the obtained weight coefficient.
Embed SENET into RESNET to be used in this paper. The structure of network module is shown in Fig. 3:

SE-RESNET network module.
The SE module is connected after the RSENET convolution module. Through global mean pooling, full connection layer and sigmoid function, the weight vector of convolution eigengraph is obtained, and identity mapping is outside the RES net and SENET modules.
Convolution features have strong robustness. In target classification, the feature map generated by the last convolution layer of the network is usually used. Compared with other convolution layers, the feature map of this layer is more abstract and higher level, and the effect of target classification is better. However, the more abstract features are, the more details of the image will be lost. When the convolutional neural network is used to distinguish objects with similar categories, the classification results are sometimes unsatisfactory. As shown in Fig. 4, there are two kinds of bottles to be distinguished. From the overall category, both belong to water bottles, but the specific categories of the two are different.

Two different types of water bottles.
Using RES net-50 convolution neural network to extract features, the feature map of one of the convolution layers in this picture is shown in Fig. 5.

Feature map extracted by convolution neural network.
It can be seen that in the deep feature map, there are many feature maps in which the difference between the two water bottles is very small. The success rate of classification is relatively low by using this feature, while the HOG feature is a global gradient feature and is very sensitive to the gradient change of pixels in the image. As shown in Fig. 6, it is a HOG feature map.

HOG characteristic map.
If we can make use of the advantages of different features separately and fuse them, it will be helpful to improve the success rate of classification.
The flow of the classifier decision-level fusion strategy algorithm is shown in Fig. 7. For the two types of features CNN and HOG extracted from the input image, two SVM classifiers are established to classify the CNN and HOG features respectively, and then the classifier output result is fitted to the posterior probability of the target category by sigmoid function. The posterior probability matrix equation is shown in Equation (1).

Flow chart of classifier decision-level fusion strategy algorithm.
Where p (cm |xn) represents the posterior probability that the nth x sample is identified as cm class, the number of target classes is m, the number of test samples is n, and prob_ estimates (k) represents the posterior probability matrix of the k(k = 1, 2) th classifier. By using decision weight for information fusion, the classification performance is better than that of a single classifier. The decision information after decision level fusion is shown in Equation (2).
Among them, Ek represents the k-th classifier, p (cj|xi, Ek) represents the posterior probability of the recognition of xi as cj class under the k-th classifier, wk represents the weight assigned to the k-th classifier, p (cj| xi, E1,..., Ek) is the final weighted probability of the recognition of xi as cj class under all classifiers. Finally, the weight is determined by normalization. The correct probability of each classifier is normalized to get the weight of each classifier, that is, the voting right. The weight determination equation is shown in Equation (3).
Where, wk represents the weight given to the k-th classifier, Ak represents the correct recognition rate of the k-th classifier, and n is the total number of classifiers.
In this method, convolution features and HOG features are extracted and classified respectively, the results of the two classifiers are output, and the weights of the classifiers are calculated according to the posterior probability to obtain the final classifier. This method only integrates different classifiers and integrates the information of the two classifications. It belongs to an integrated learning strategy and can improve the success rate of target recognition in some scenarios. However, the unchanged feature extraction strategy means that the method cannot fundamentally improve the accuracy rate.
Figure 8 shows a feature map fusion strategy. The input image enters a convolution neural network for feature extraction, selects feature maps of some layers, extracts HOG features from the feature maps, adds the feature maps of the same channel, and finally sends the feature map vectors to a support vector machine for classification.

Feature map fusion strategy.
This method accumulates the convolution feature map and the HOG feature map, which is the real feature fusion. But for different objects, different scenes, different features have different gains to the classification accuracy, and the rigid fusion principle is insufficient.
Different features have different sensitivity to different forms of objects. Only using the same model cannot determine the feature weight that makes the classification accuracy optimal. Based on this problem, this section proposes a feature fusion network resnet-s with adaptive weight, and uses multiple more accurate classifiers instead of multiple classifiers. Each two classifier is only responsible for judging whether the object is the target object.
Another advantage of the two classifiers over the multi-classifiers is that the multi-classifiers do not distinguish between similar categories or objects of the same category sufficiently. For example, the features of A and B objects in the scene are very close. Using the multi-classification method, the probability scores of the categories of the two objects will be almost the same. However, the two classifiers can train objects similar to the target object as negative samples on the premise of knowing some prior knowledge (objects prone to misclassification) to improve the classification accuracy. The loss function of multiclassification network is softmax loss, which is the transformation form of cross entropy loss multiclassification. In this paper, the loss function of biclassification network directly selects cross entropy loss, and its deduction process is as follows:
In the two classification problem, the actual sample label is 1 or 0, representing positive and negative samples respectively. The output of the model usually passes through the sigmoid function, making the result become the probability value between [0,1]. This probability reflects the possibility that the target is a positive sample, and the greater the probability is, the greater the possibility is. The equation of sigmoid function is:
In the equation, s is the output of the previous model. Sigmoid function is characterized by: when s = 0, g (s)=0.5, when s is far greater than 0, g ≈ 1, when s is far less than 0, g ≈ 0. g (s) is the model prediction probability value of cross entropy equation. This probability value represents the probability that the current sample is a positive sample:
Resnet-s and resnet-50 comparative experiment
Taking the streetcape as the scene, the network performance experiment is carried out. In the experiment, 10 kinds of objects such as traffic light, stop sign, car, people, motorbike, laptop, water bottle, chair, desk and backpack were selected for the experiment, as shown in Fig. 9.

Object detected in this paper.
The data set used in this paper is real-world pictures taken in multiple urban environments. The objects to be tested are common objects in urban environment. There are 800 pictures in the training set and 200 pictures in the test set.
The deep learning framework of this project is python. Python is the python version of torch, a deep learning framework developed by the Facebook team. Different from the static calculation chart of tensorflow, Python’s calculation chart is dynamic. It can change the calculation chart in real time according to the calculation needs, and print the intermediate results at any time, which is convenient for network debugging.
Firstly, the model trained on image net by fast RCNN is applied to the scene of reality directly. Image net is a computer vision system recognition project. It is the largest image recognition database in the world. It was established by Stanford computer scientists to simulate human recognition system. There are more than 14 million images in the dataset, and at least one million images provide a bounding box for target detection. In this paper, the Faster Rcnn network model, which has been pre trained on image net, is used to optimize the parameters. The model has a total of 1000 categories of objects. Reduce the detection threshold, select the 20 detection boxes with the highest score, and the detection results are shown in Fig. 10. Most of the objects selected in the detection box are independent of the objects to be detected in this paper. The objects in the box can be intercepted as the background of the multiple classification network.

Test results of training network.
Then fine tune the network pre trained by Faster Rcnn on image net combined with its own data set. The training data is shown in Fig. 11. The blue box in the figure is the manually marked category box. Ensure that the number of occurrences of each type of object in the data set is greater than 100.

Manually labeling training data set.
The training parameters of the network mainly include: learning rate ɛ, momentum factor μ, rotation angle α, number of pictures in each iteration, and the setting parameters are shown in Table 1:
Training parameter setting
Intercept the target in the box of the fine-tuning network as a positive sample of the two classification network, as shown in Fig. 12 is a positive sample of the green water bottle.

Water bottle positive sample.
For each type of object, 200 positive samples and 200 negative samples are used respectively, and the accuracy of resnet-50 and RSENET-s are used for ten types of samples, as shown in Table 2.
Resnet-50 and RSENET-s comparative experiment
From the data in Table 2, it can be seen that the average recognition accuracy of RSENET-s is basically the same as that of resnet-50, and the average performance of resnet-50 is slightly higher than that of RSENET-s, and the stability is stronger, which proves that RESNET-50 is more universal. However, in the classification of targets with more similar features, the recognition accuracy of RSENET-s is slightly higher (such as distinguishing different types of bottles, water cups, etc.), which is mainly due to the gain of gradient features of the hog on network performance. Using RSENET-s in the precise classification network of this paper has better effect.
The abscissa in the figure is epoch (one epoch for all data sets after training) and the ordinate is the loss of the network. ResNet-S has a slower convergence speed than ResNet-50. The first reason is that there is global mean pooling in ResNet-S, and this module is not easy to converge. Secondly, HOG feature parameters do not participate in training, and there are large differences in feature distribution and convolution features, which will also lead to slow training. When the two networks finally converge, the losses are basically the same.
First of all, we directly use Faster-Rcnn to carry out an accurate classification experiment, and then we use low threshold detection combined with ResNet-S classifier experiment with coco database. The test results are shown in Fig. 14, where the Fig. 14(a) is the original method and the figure below is the improved method. In the improved method, the probability value is the probability output by the feature extraction network network, and only the detection boxes with probability values above 0.25 are retained.

Network loss reduction chart.

Accurate classification network experiment map.
In Fig. 14, the green oval box shows the false detection and missed objects in the scene. As can be seen, the missed detection rate in the Fig. 14 (b) is obviously lower than that in the picture above, and the average probability scores are very close. The improved algorithm has some advantages over the traditional algorithm in the recognition of complex texture targets such as knapsack or traffic sign with good background fusion. However, when the two targets have a large overlap area, the two algorithms have the same degree of target repeat marks. In terms of detection time, since the method in this paper adopts multiple binary classification networks instead of multiple classification networks, the detection time is improved compared with Faster-Rcnn method. When the number of detection categories is 80, the detection time consumption is increased by about 20%.
The accuracy rate and recall rate are calculated with a probability threshold of 0.5. Compare the recall rate and recall rate of 50 pictures in the urban environment by using the recall rate of fast RCNN directly and the accurate classification method combined with prior knowledge. The accuracy rate is only considered as the correct detection if the accurate category of the target is accurately detected by the detection result, as shown in Table 3.
Comparison between traditional Faster-RCNN and the method in this paper
It can be seen from the data in the table that under the condition of known part of prior knowledge (such as target space position constraint), the method proposed in this paper can greatly improve the accuracy and recall rate of target detection in the known environment. This method has strong practicability.
Firstly, this paper introduces the basic principles of SENET network and hog features, compares a variety of feature fusion strategies, and proposes an adaptive weight target classification network RES net-s. Experiments show that RES net-s has excellent performance in target detection with similar categories. In this paper, a strategy is proposed to replace the multi classification network with multiple accurate two classification networks, which greatly improves the accuracy and recall rate of target detection in the known environment.
Footnotes
Acknowledgment
This paper is supported by “the Fundamental Research Funds for the Central Universities”: Research on the visual evaluation of winter landscape in cold cities in Northeastern China—with examples of Shenyang and Harbin (PID: 02040022120002).
