Abstract
Tens of thousands of work-related injuries and deaths are reported in the construction industry each year, and a high percentage of them are due to construction workers not wearing safety equipment. In order to address this safety issue, it is particularly necessary to automatically identify people and detect the safety characteristics of personnel at the same time in the prefabricated building. Therefore, this paper proposes a depth feature detection algorithm based on the Extended-YOLOv3 model. On the basis of the YOLOv3 network, a security feature recognition network and a feature transmission network are added to achieve the purpose of detecting security features while identifying personnel. Firstly, a security feature recognition network is added side by side on the basis of the YOLOv3 network to analyze the wearing characteristics of construction workers. Secondly, the S-SPP module is added to the object detection and feature recognition network to broaden the features of the deep network and help the network extract more useful features from the high-resolution input image. Finally, a special feature transmission network is designed to transfer features between the construction worker detection network and the security feature recognition network, so that the two networks can obtain feature information from the other network respectively. Compared with YOLOv3 algorithm, Extended-YOLOv3 in this paper adds security feature recognition and feature transmission functions, and adds S-SPP module to the object detection and feature recognition network. The experimental results show that the Extended-YOLOv3 algorithm is 1.3% better than the YOLOV3 algorithm in AP index.
Introduction
The construction site’s harsh environment and changeable conditions can easily lead to construction safety accidents [1], many of which are caused by workers being struck without wearing safety facilities on the construction site. Therefore, it is necessary to remotely locate pedestrian workers at the construction site and supervise the workers to wear safety equipment in accordance with existing safety regulations and standards to protect the safety of construction workers [2]. For this safety issue, vision technology is currently used to locate workers. It is becoming more and more common to place cameras around the site to capture site activities and record construction progress [3–6]. In addition, workers do not need additional physical markings.
Object detection algorithm based on deep learning can solve the problem of complex appearance of the same or similar objects in computer vision technology. At present, object detection algorithms based on deep learning [7–12] are mainly divided into two directions: One is the two-stage direction represented by R-CNN [7–9] series. These algorithms divide the object detection into two stages. Firstly, the Region Proposal Networks (RPN) is used to extract the candidate object information, and then the detection network is used to predict and identify candidate object locations and categories. The other is the one-stage direction represented by the YOLO series [10–12]. This kind of algorithm does not need to use RPN, but directly produce the object’s location and category information through the network, which is an end-to-end object detection algorithm. Therefore, one-stage detection algorithms usually have faster detection speed. Many scholars have improved the network on the basis of YOLO to achieve the desired effect. For example, the detection of YOLOv3 airport aircraft studied by Guo Jinxiang and others [13] is to increase the receptive field of the network by adding dilated convolutions in the backbone network to enhance the detection accuracy of the network. Cui Jiahua et al.’s research on lightweight object detection networks for embedded platforms [14] is to replace ordinary convolutions with deep separable convolutions to reduce model parameters and improve detection speed. Yi, Z and others [15] proposed an improved tiny-yolov3 pedestrian detection algorithm. The proposed method uses K-means clustering on our training set to find the best priors and has higher detection accuracy. Chang Haitaoand others [16] Improved Faster R-CNN, adjusted the size and number of anchors, and enhanced the input image data to improve the final recognition effect. Wei Yongming and others [17] improved the YOLO v2 network by changing the filtering rules of candidate frames and other methods, and achieved a relatively ideal effect in the aerial image positioning task of drones. Based on the idea of single-structure direct regression, Yi and others [18] proposed A deep neural network for object detection is proposed named ASSD. With the global relation information, ASSD learned to highlight useful regions on the feature maps while suppressing the irrelevant information, thereby providing reliable guidance for object detection. Tengyue Li and others [19] proposed a novel data fusion framework is proposed for combining data which are collected from both sensors with the aim of enhancing the HAR accuracy. In order to enhance the expression ability of shallow features, Fu Kun and others [20] proposed a feature-fusion architecture to generate a multi-scale feature hierarchy, which augments the features of shallow layers with semantic representations via a top-down pathway and combines the feature maps of top layers with low-level information by a bottom-up pathway. These improved algorithms are of great significance in the field of object detection, but these methods cannot perform secondary extraction of objects.
This article proposes an Extended-YOLOv3 algorithm based on the extension of YOLOv3 model, which can simultaneously extract features for security feature recognition while detecting personnel. First, based on the object detection network framework of the YOLOv3 algorithm, the security features of the detected objects are added side by side, and S-SPP algorithm is used to broaden the features of the deep network on the backbone network. Then a feature transmission network is added between the two networks to obtain more feature information from the other network, and an end-to-end convolutional neural network is designed for construction workers’ detection and safety feature recognition.
The Extended-YOLOv3 algorithm can be used on most construction sites and does not require any additional remote sensors on equipment and workers. The shielding problem of construction camera can be solved or alleviated by increasing the installation height of the camera and carefully selecting the camera position at the construction site. At the same time, the research work in this article does not replace field inspectors, but helps them to monitor the construction workers, and feedback the movement and characteristics of the workers in real time to monitor and judge the safety of the construction workers, so as to enhance the safety of the construction site.
The main contributions of this article are as follows:
An Extended-YOLOv3 multi-scale deep feature detection algorithm is proposed. This algorithm performs secondary feature extraction on detected human objects, unifies the human object detection network and object feature recognition network into a neural network, and implements an end-to-end detection method. For people in the acquired image or video data, the algorithm can use the human object detection network to detect multiple objects at the same time, detect the position and size of these objects, and use the object feature recognition network to detect the features of these objects.
Three sub-networks of the Extended-YOLOv3 algorithm are designed: a construction worker detection network, a security feature recognition network, and feature transmission network.
The S-SPP algorithm is used to broaden the deep network features and help the network to extract useful multi-scale deep-level features in high-resolution input images. A special feature transmission network is designed to transfer features between the construction worker detection network and the security feature recognition network, so that the two networks can benefit from the other network respectively.
Traditional YOLOv3 algorithm
The YOLOv3 algorithm has a wide range of applications in the field of object detection. It can accurately find the location of an object in a given picture, and label the type of the object. The YOLO series model is one of the fastest and most accurate object detection algorithms. It has become one of the most popular object detection algorithms in practical applications.
Compared with the R-CNN series of object detection algorithms, YOLO provides another more direct idea: YOLO algorithm converts the object detection problem into a regression problem. The input graph is fed into a convolutional neural network structure to directly predict the object position and category probability of the object at the output layer.
As shown in Fig. 1, the YOLOv3 algorithm first divides the input image into multiple grids. If the center point coordinates of an object are in one of the grids, the grid is responsible for predicting the object. For example, the red grid is responsible for predicting the position and size of the construction workers in Fig. 1, and the orange-yellow box indicates the actual size of the construction workers. Then multiple sets of predictions are made for each grid. Each set of predictions contains t x , t y , t w , t h , t o , and C class probabilities. Among them, t x and t y represent the center coordinates of the prediction object, which are aligned with the upper left corner of the grid, and their range is 0 to 1. And t w , t h represent the change scale based on the current prior frame; t o represents the accuracy of the prediction made and evaluates the authenticity of this group of predictions.

YOLOv3 algorithm.
Further, YOLOv3 improves the YOLO series models in terms of object detection accuracy, and its network structure is shown in Fig. 2. First, YOLOv3 uses a new backbone network, namely the basic classification network darknet-53 as the feature extractor. There is no pooling layer and fully connected layer in the entire YOLOv3 structure. In the forward propagation process, the size transformation of the tensor is realized by changing the step size of the convolution kernel, which has been reduced for 5 times, and the feature graph will be reduced to 1/32 of the original input size. Therefore, the YOLOv3 network usually requires that the pixel size of the input picture is a multiple of 32. The feature extraction network in YOLOv3 uses the Darknet-53 network, which uses more consecutive 3×3 and 1×1 convolution layers and organizes them into residual blocks similar to the ResNet network. At the same time, in order to improve the network performance. After each layer convolution, the batch normalization layer and leaky Relu are added. Adding Batch Normalization layer can speed up the convergence speed of training, and using an activation function in the form of leaky Relu can avoid the phenomenon of deep network gradient disappearance. Therefore, Darknet-53 is much more powerful than Darknet-19 and more efficient than ResNet-101.

YOLOv3 network structure.
Secondly, according to the feature pyramid network idea of object detection [21], YOLOv3 uses three different scale feature maps for location and category prediction. The finer the grid, the finer the object can be detected, which effectively improves the accuracy of object detection rate. YOLOv3 is set to predict 3 bounding boxes of each scale, and a total of 9 bounding boxes match the 9 specific bounding sizes obtained by clustering according to the size, and detect objects of different sizes.
In YOLOv3, three types of anchor frames are predicted for each scale, and a total of nine types of anchor frames are used to detect objects of different sizes. Each anchor box corresponds to 5 + C values, 5 represents the property information of the predicted anchor box: the coordinates of the center point (x, y), the width and height of the anchor box (w, h), and the confidence p of the existing object; C represents the total number of categories in the dataset. A total of 10647 (13×13×3 + 26×26×3 + 52×52×3) anchor frames are output for the three scales. Finally, the non-maximum suppression algorithm is used to select the anchor frame with the highest confidence score as the detection frame.
YOLOv3 uses Cross-Entropy loss as a loss function. In confidence and category prediction, considering that a object may belong to multiple categories, YOLOv3 uses multiple independent logical classifiers (logistic) instead of Softmax to classify each box, ensuring accuracy and solving the problem that Softmax is not suitable for object classification with overlapping category labels.
YOLOv3 algorithm is one of the most widely used object detection algorithms. In the context of the construction industry, this paper adds object feature recognition to YOLOv3 algorithm, and proposes a multi-scale construction worker depth feature detection algorithm based on Extended-YOLOv3. Safety feature identification is carried out while detecting construction workers the framework is based on the YOLO network and integrates multiple tasks, multi-object detection networks and feature recognition networks, and enhances the use of features.
Specifically, the Extended-YOLOv3 algorithm separates the construction worker detection task from the safety feature identification task, reduces the influence of the safety feature identification task on the construction worker detection task, and ensures the stability of the object detection function in the YOLOv3 algorithm. The object detection network is used to detect multiple object people at the same time, detect the position and size of these people, and at the same time, the feature detection network is used to identify the features of these targets. The feature extraction, person detection, and feature detection are unified in a neural network to achieve an end-to-end [22], detection method.
The Extended-YOLOv3 algorithm in this paper mainly designs three sub-networks: a construction worker detection network, a security feature recognition network, and two feature transmission networks. Figure 3 shows the main structure of the Extended-YOLOv3 algorithm. The construction worker detection network is behind the Darknet53 network and is mainly composed of convolutional layers. For the image features input from the Darknet53 network, the personnel objectives and locations are detected by convolution. The security feature recognition network is similar to the construction worker detection network. It is also behind the Darknet53 network and is mainly composed of convolution layers. For image features input from the Darknet53 network, the security features are identified by convolution. The main purpose of the two feature transmission networks is to make full use of the features obtained from the construction worker detection network and the safety feature recognition network, and transmit feature signals between the construction worker detection network and the safety feature recognition network. In this way, one network can benefit from the other network.

Network Structure of Extended-YOLOv3.
In order to achieve higher efficiency and better feature extraction effect, the feature extraction network of Extended-YOLOv3 also uses Darknet-53 network. Based on the convolutional neural network, new feature expansion modules, S-SPP module and feature transmission module are designed. Based on the original residual network [23], S-SPP module and feature transmission module are used to extract the deep features better and enhance the prediction ability of the network.
S-SPP module
In order to obtain more and more effective feature information from the input image data of multi-scale network and improve the prediction effect of the network, this paper proposes a scale-invariant spatial pyramid pooling (S-SPP) based on the spatial pyramid pooling (SPP) idea, and enrich the feature information extracted by deep convolutional neural network [24] with minimal modification. The network structure of the proposed S-SPP module is shown in Fig. 4. The S-SPP module is mainly composed of 4 parallel maximum pooling layers, 1 connection layer, and 1 convolution layer. The kernel sizes of the maximum pooling layer are 1×1, 5×5, 9×9, and 13×13. The four maximum pooling layers can extract the multi-scale features with different sensory fields from the input image features, and the connection layer can integrate the multi-scale features extracted from the four maximum pooling layers and fuse them in the channel dimension of the feature map. The main purpose of the convolutional layer is to restore the features to the size before entering the S-SPP module, so that the additional features introduced by the S-SPP module will not affect the subsequent operations here. The multi-scale features obtained by the deep neural network in the same layer can further improve the prediction accuracy of the network, and the calculation amount is small.

Network structure of S-SPP module.
In order to make full use of the characteristics of the construction worker detection network and the security feature recognition network, a feature transmission network is designed in this paper to transfer the characteristic signals between the construction worker detection network and the safety feature recognition network. The output of one network is collected and processed into another network, and the characteristic signals are transmitted between different subordinate networks, and the characteristics of the current network are modified by using the characteristics of other networks. In this way, each network can gather the characteristics of two networks at the same time and benefit from other networks. When the construction worker detection network is trained and predicted, the construction worker detection network obtains features from its own network. At the same time, it can use the features transmitted from the security feature recognition network to correct its own features. Similarly, when the safety feature recognition network is trained and predicted, the safety feature recognition network can obtain features from its own network and can also use the features transmitted from the construction worker detection network to correct its own features. Therefore, the multi-task framework proposed in this paper can make use of the complementary advantages of different features to generate better inferences.
The characteristic transfer network of the Extended-YOLOv3 algorithm is shown in Fig. 5, which can be expressed as H (x a ) = F (x b , W b ) + x a . The feature transmission network obtains features from one network and supplements them to another network. This process can be transformed into learning a residual function F (x b , W b ) = H (x a ) - x a , where x a is the shallow output and H (x a ) Is the deep output, F (x b , W b ) is the transformation represented by the two layers sandwiched between the two, and is a function of x b , that is to say, F (x b , W b ) is a complement to x a , and it is Fine-tuning for x a . In this way, the task is changed from mapping from x a to a new xa+1 to finding the gap between x a and xa+1 based on x b , which is obviously a relatively simple task and can effectively improve the network effect. When the features represented by shallow x a are mature enough, any change to feature x a will make the loss larger, F (x b , W b ) will automatically tend to learn to become 0, and x a will continue to pass from the path of identity mapping. In this way, the feature transfer network is equivalent to fitting the difference between the construction workers detection network features and the safety features and the further optimized features, which is easier and more accurate than the collection of the features of the two networks to produce new features.

Network structure of feature transfer.
The mathematical expression of the feature transmission network can be simply written as follows:
Among them, x a and x b respectively represent the input of the construction worker detection network and the security feature recognition network, F (x b , W b ) represents the feature map supplemented from one network to another network; xa +1 represents the further optimized network features.
For backpropagation, assuming the loss function is ɛ, according to the chain rule of backpropagation, we can get:
It can be found that this derivative can be divided into two parts, which are not passed through the weight layer
In summary, the feature transmission network will help the identity mapping of the features of the front-end network belonging to the current back-end network in the forward process. Its most important function is to change the way of forward and backward information transmission, which greatly promotes the optimization of network.
This article hopes to further analyze the construction workers while detecting the construction workers to assist humans in safety inspection. Therefore, the Extended-YOLOv3 algorithm directly adds a security feature recognition network after the backbone network Darknet53 to identify the security features of the construction workers and obtain the security feature attributes of the construction workers. The safety equipment in this paper is for the safety helmet, protective clothing and other visible objects. Thus, a multi-task model is established for training and prediction, which saves computing resources.
In order to correspond to the construction worker detection network, the security feature recognition network also draws on the feature pyramid network idea FPN, adopts multi-scale object security feature recognition, and also outputs feature maps at three different scales. As shown in Fig. 6, the input image size of 416×416 is taken as an example for analysis. The feature recognition network is behind the Darknet53 network layer, and a scale (13×13) detection result is obtained through 7 convolutional layers. Compared to the input image, the sample used here is 32 times subsampled, and the size is 13×13 feature map. Due to the high multiple of subsampled, the features expressed by the feature map here are relatively large, so it corresponds to the maximum scale output in the detection network of construction workers. In order to correspond to the other two scales, the feature map obtained from the fifth convolutional layer in the output maximum size path is up-sampled so that the size of the feature map becomes 26×26, and then merge with the 16 times lower sampling feature map obtained on Darknet53 network. The maps are fused to obtain a finer-grained feature map. After 7 convolutional layers, it can get 16 times subsampled of the input image. The feature drawing’s size is 26×26. Its feature expression of the object is small, corresponding to the middle scale output in the detection network of construction workers. Finally, the feature map obtained after the fifth convolutional layer in the output size path is up-sampled and connected to the 52×52 size feature map during the convolution process to obtain the feature map with the size of 52×52 sampled 8 times subsampled to the input image. Its feature expression of the object is the smallest, corresponding to the minimum scale output in the construction workers detection network.

Network structure of feature recognition.
In order to correspond to the construction worker detection network, the security feature recognition network predicts 3 groups of features at each scale. Each group of features uses multiple independent logical classifiers (logistic) [26] to classify each box when it is predicting the confidence and category, which have ability to support multi-label objects.
When the object detection network is trained and predicted, the object detection network obtains features from its own network and uses the S-SPP module to expand the features, and can use the features transmitted from the object feature recognition network to correct its own features. Similarly, when the object feature recognition network is trained and predicted, the S-SPP module is also used to expand the features, the object feature recognition network is used to obtain features from its own network, and the features transmitted from the object detection network can be used to modify its own features.
The Extended-YOLOv3 algorithm performs detection on the input image data at three scales, which are 32 times subsampled, 16 times subsampled, and 8 times subsampled. The reason for using up-sampled in the network: the deeper the network, the better the feature expression effect. For example, when performing detection at 16 times subsampled, if the fourth down-sampled feature is directly used for detection, then shallow features are used. The results are generally not good. If you want to use 32 times down-sampled features, ang the size of deep features is too small, so Extended-YOLOv3 uses up-sampling with a step size of 2 to double the size of the feature map obtained by down-sampling by 32 times. It becomes a dimension after 16 times down-sampling. Similarly, 8 times sampling is also up-sampling with a step size of 2 for 16 times down-sampling, so that deep features can be used for detection.
Deep features are extracted by up-sampling, and their dimensions are the same as the feature layer dimensions to be fused. The feature map with the size of 26×26×256 was obtained by up-sampling the feature map with the size of 13×13×256, and stitching the 26×26×512 features from the shallow layer to get 26×26×768. It also needs to perform a series of 3×3, 1×1 convolution operations. This can not only increase the degree of non-linearity, increase generalization performance to improve network accuracy, but also reduce parameters and improve real-time performance. The last feature of 52×52×255 is similar.
Loss function
The results of the Extended-YOLOv3 algorithm are position (x, y), scale (w, h), category (C), person (P), and person characteristics (f). Among them, the loss function of the scale uses the total square error [27], and the loss function of position, person, and the person feature uses binary cross entropy [28], so the loss function of the Extended-YOLOv3 algorithm is:
Among them, loss
grid
represents the total loss of one dimension of the Extended-YOLOv3 framework. If the bounding box is responsible for detecting personnel, the
Among them, loss all represents the total loss of the Extended-YOLOv3 algorithm.
Data set preparation and simulation environment
This paper adds a security feature recognition network based on the YOLO framework, and proposes the Extended-YOLOv3 algorithm, which has better results and broader application capabilities, and is more suitable for real-time construction worker detection and security feature recognition at the construction site. In this paper, the validity of the Extended-YOLOv3 algorithm is verified on the collected construction personnel data set. The Extended-YOLOv3 algorithm is implemented using Python and Pytorch, and uses some of the weight parameters of the publicly available Darknet [29] and YOLOv3. In the verification experiment of the Extended-YOLOv3 algorithm, this paper uses a Windows workstation equipped with Intel Core i7-9700K CPU @ 3.60 GHz, 64 GB RAM and Nvidia GeForce RTX 2080 Ti to train and evaluate the model.
The collected prefabricated building personnel data set consists of 3 parts, as shown in Fig. 7, which are prefabricated building site collection, network collection, and INRIA Person Dataset, which are composed of 5817 static images. The training and validation sets contain 5,017 and 800 images respectively. Images are labeled as borders, person definition classes, and feature definition classes. All models used in this paper are trained on the training set and evaluated on the validation set.

Data set composition.
Multi-scale training
YOLOv3 only has a convolution layer and a pooling layer [30], so there is no need to fix the size of the input picture. To make the model more robust, multi-scale training [31] was introduced. That is, during the training process, the size of the input picture of the model is changed every certain number of iterations. This network training method enables the same network to detect images of different resolutions.
Note: This step is used when training object detection, not when training feature extraction networks.
The network input is 416×416. After 5 times of sampling, the 13×13 feature map will be output, which is 32 times down-sampling. Therefore, a multiple of 32 is used as the input size, specifically using 320, 352, 384, 416, 448, 480, 512, a total of 7 feature sizes.
When the input picture size is 320×320, the size of the large-scale feature map is 10×10, and when the input picture size is 512×512, the size of the large-scale feature map is 16×16.
Each time changing the size of the input picture, it needs to process the last detection layer and then start training.
Training process
In the verification experiment, SGD is used to train the Extended-YOLOv3 algorithm, and its momentum is 0.9. This paper uses an initial learning rate of 0.0001, and attenuates by a factor of 10 at the 500th and 800th epochs. This article sets the epoch number to 1000 and the training batch size to 16.
The Extended-YOLOv3 algorithm uses convolution and predefined bounding boxes to predict the bounding boxes, and output feature maps of 3 scales. Each grid in each feature map predicts 3 bounding boxes, and each bounding box will predict three parameters: (1) the position of each box, that is, the center coordinates t x and t y , the height t h and width t w of the box, (2) the confidence t o , and (3) C categories. There are 2 categories in the experiments in this paper.
In each of the three tests, the corresponding sensing field is different, the receptive field of 32 times down-sampling is the largest, suitable for detecting large objects. The 16× receptive field is suitable for objects of average size. The 8× receptive field is the smallest, suitable for detecting small objects. When the input is 416×416, there are actually 52×52 + 26×26 + 13×13)×3 = 10647 detection frames. The dimensions of the nine prior frames are shown in blue in Fig. 8. The yellow box represents the real bounding box, and the red box is the grid where the center point of the object is located. In Extended-YOLOv3, k-means clustering is also applied to objects in the image and select 9 cluster centers as the prior frames.

Anchor boxes of YOLOv3.
This paper experimentally evaluates three Extended-YOLOv3 network models with different input sizes, including 416×416, 608×608, and 800×800. Multi-scale training is achieved by randomly adjusting the size of the input image. The objective confidence and non-maximum suppression thresholds for all models were set to 0.1 and 0.5 respectively. In addition, in this experiment, some of the weight parameters of Darknet53 and YOLOv3 pre-trained on ImageNet are used to initialize the backbone network of the Extended-YOLOv3 algorithm. Using Pytorch performs normal training and fine-tuning for subsequent use.
When testing the results of the Extended-YOLOv3 network model, a variety of performance indicators need to be used for evaluation. This article evaluates all these models based on the following 6 indicators: (1) the precision of the object features, (2) the recall rate of the object features, (3) the F1-score of the object features, and (4) mean of average precision (mAP) measured at 0.5 intersection over union (IOU) [32], (5) frames per second (FPS) of the model, (6) model inference time.
For binary classification problems, samples can be divided into four types: true positive (TP), false positive (FP), true negative (TN), and false negative (FN), Precision(P) and recall(R) are defined as follows (5) (6):
Precision measures the probability that a classifier produces a positive that is indeed a positive. Recall measures the ability of a classification to find all positive.
The precision-recall curve, called P-R curve for short, can be obtained by using the precision ratio as vertical axis and the recall ratio as the horizontal axis. The F1- score was also used to evaluate the performance of the model. The definition of the F1 score is shown as follows (7):
AP stands for Average Precision; mAP represents Mean Average Precision, which is to obtain Average AP values for multiple validation sets.
As shown in Fig. 9, Fig. 9 (a) shows the convergence of the loss function during the training of the Extended-YOLOv3 network. The value of the loss function at the beginning of the training is about 250. During the continuous training process, the loss function gradually approaches smooth, the minimum reaches 1.3492, that is, the ideal effect is achieved, indicating that the network has reached the optimal training effect. Figure 9 (b) shows the accuracy of the object feature recognition network on the training set. The reason of the maximum accuracy rate of the object feature recognition network cannot be greater than 95% is that the output of the object feature recognition network is affected by the object detection network. When the object detection network cannot identify the object, the object feature identification network fails.

Frame training situation.
Tables 1 and 2 collect the detection performance of all models on the validation set of the collected data set.
Detection performance of YOLOv3 and YOLOv3-SPP3
Detection performance of Extended-YOLOv3 model
This study validates the effectiveness of the proposed multi-task network framework on the validation set of the collected data set. The results show that, as shown in Table 1 and Fig. 10, when the input size is 800×800, the YOLOv3-SPP3 score on the AP is about 0.9% higher than YOLOv3. This means that the S-SPP module can help the network extract more useful multi-scale deep-level features from high-resolution input images. At the same time, due to the convolutional layer in the S-SPP module, the trainable parameters required for YOLOv3-SPP3 have not increased much.

AP performance comparison of three size models.
As shown in Table 1, Table 2, and Fig. 10, the feature transmission network has better performance on three different input sizes, making the Extended-YOLOv3 network model score about 0.4% higher on the AP than YOLOv3-SPP3. This means that using a characteristic transmission network can supplement the characteristics of one network to another and enhance network performance. When the input size is 800×800, the score of Extended-YOLOv3 on the AP is about 1.3% higher than YOLOv3. At the same time, the framework of this paper can detect whether workers wear safety facilities, indicating that the method proposed in this paper can better estimate the position of workers, and can Identify the safety characteristics of construction workers and improve the safety of construction workers at construction sites.
Compared with the YOLOv3 network model, the average accuracy of the Extended-YOLOv3 network model including the S-SPP module and the characteristic transmission network is significantly improved, and the detection effect is better. The results show that models with deeper features (YOLOv3-SPP3) and wider features (Extended-YOLOv3) are more powerful and effective.
In addition, when the input size of the S-SPP module is 416×416, the detection performance of YOLOv3-SPP3 is poor, because the smaller pictures contain fewer useful features, and the final convolution in the S-SPP module Layers use less useful information to make it more difficult to fit more optimized features and even reduce the expressiveness of the fitted features. Therefore, the S-SPP module should be used when the input size of the network is large. If the fitting effect of the feature transmission network is not ideal during training, the weight of the feature transmission network can be reduced to 0 to eliminate the adverse effects.
In order to verify the performance of the model proposed in this paper, the proposed model is compared with YOLO-V2[33], YOLO-V3 [34], and SlimYOLOv3 [35], which is the relatively advanced target detection model at present. These algorithms only detect the target and fail to identify the features of the detected object. The algorithm in this paper can not only detect the target but also identify the security features of the detected object. The three pictures selected from all the detection images as object detection results are very typical and effective, because they include tiny and blurry object objects, sparse objects, and dense objects. As shown in Fig. 11, it can be seen that the Extended-YOLOv3 algorithm is not only superior to other algorithms in detection performance, but also can detect whether people are wearing safety equipment, which has higher practical application value.

Detection results of the four different models.
The F1 scores, mAP, FPS and inference time of the models are shown in Table 3. In terms of detection performance, the proposed Extended-YOLOv3 model is slightly superior to the YOLO-V2 and SlimYOLOv3 models. The F1-score of Extended-YOLOv3 is 70.1, which is higher than the other three models. It is 1.8 higher than YOLO-V2, 0.5 higher than Yolo-v3 and 3.8 higher than SlimYOLOv3. The mAP index is 7.1 higher than SlimYOLOv3, slightly lower than YOLO-V2 and YOLO-v3. However, this paper can carry out depth feature recognition on the target, with more powerful functions. Due to more complex network and stronger functionality, the running time is longer than other algorithms.
The performance comparison of the different models
The Extended-YOLOv3 algorithm integrates an S-SPP module, a feature transfer network, and an added feature detection network. There is almost no modification to the original YOLOv3 model with minimize the impact of the newly added network on the original network. However, the data set used in this article is an irregular data set collected from different channels, which has a high standard of inconsistent labeling. In the experiments in this paper, the problem of inconsistent labeling standards was not deliberately addressed. In order to improve the detection accuracy of the model, further research is needed to solve the problem of non-uniform labeling standards or use a large data set with uniform labeling standards.
This paper proposes an Extended-YOLOv3 multi-scale construction worker deep feature detection algorithm that can detect safety features while detecting construction workers, supervise workers’ safety measures to protect the safety of construction workers. The Extended-YOLOv3 algorithm adds a security feature recognition network based on the YOLOv3 algorithm, and uses the S-SPP algorithm to widen the features. A feature transmission network is designed between the construction worker detection network and the safety feature recognition network to transmit feature and improve the accuracy of the algorithm. Compared with the YOLOv3 algorithm, the Extended-YOLOv3 algorithm improves the AP index by an average of 1.3. The effect is better and the application is broader. It is more suitable for real-time object detection at construction sites.
Footnotes
Acknowledgments
This study was supported by the National Science Foundation of China (51975130,51705340); Natural Science Foundation of Liaoning Province (20180550002); Key Research and Development Project of Liaoning Province (2017225016).
