Abstract
In order to improve the tracking effect of prevention and control, this paper combines three-dimensional face recognition technology and posture recognition method to build a prevention and control tracking system, and designs a single-stage human target detection and posture estimation network based on thermal infrared images. By using infrared recognition methods, the face and human body shape can be directly recognized, overcoming the problem of traditional visual recognition methods being easily occluded. The network can simultaneously complete two tasks of human target detection and pose estimation in a multi-task manner. Moreover, this paper uses the knowledge distillation strategy to train a lightweight model to further reduce the amount of model parameters to improve the inference speed. In addition, this paper uses a single-stage human target detection and posture estimation network to judge the effect of prevention and control tracking. Through the prevention and control tracking effect test, it can be seen that the prevention and control tracking system based on 3D face recognition proposed in this paper can effectively improve the prevention and control tracking effect.
Introduction
Safe City is a key business system in the construction of e-government in recent years. The urban alarm and monitoring system is the core of the construction of a safe city and an important part of the social security prevention and control system. Moreover, the construction of a safe city has become an important project to provide security and related services for citizens [1]. The main body of a safe city is to use modern information and communication technology to achieve unified command, rapid response, coordinated operations, and improve the efficiency of public security work, so as to meet the needs of dynamic security management and timely and effective crime fighting under modern economic and social conditions, strengthen China’s urban security capabilities, and speed up the construction of urban security systems [2]. It is difficult to adapt to the traditional mode of combating, preventing and managing. In order to adapt to the development needs of the new situation, we must take the road of strengthening the police with science and technology and preventing and controlling with science and technology, among which the construction of urban video surveillance system is an important means [3].
The biggest difference between GIS and other systems is the ability to store and analyze spatial data. The video data of video surveillance systems are mostly geospatial, and the surveillance systems that manage and maintain these video data have complex spatial topological relationships. In addition, shortest path search is a very important function in GIS spatial analysis [4]. The shortest path search algorithm is also being used more and more widely. In addition to the traditional services such as GPS navigation services and urban electronic map path finding services, urban pipe network planning, communication systems, and safe city supervision systems also introduce shortest path search methods. technology [5]. The shortest path not only refers to the shortest distance in the general geographical sense, but can also be extended to other metrics, such as time, cost, and line capacity. Correspondingly, the shortest path problem becomes the fastest path problem, the lowest cost problem, and so on. In the process of driving, it is also necessary to calculate the driving route in front of the vehicle in real time. Therefore, the shortest path analysis should not only be able to characterize the actual access network, but also have a fast enough response speed [6]. Specifically for the GIS-based public security video surveillance system, when a crime occurs, the system can accurately determine the alarm point and the location of the crime scene, automatically notify the relevant police units around the crime scene, and quickly generate the shortest driving route from the criminal’s escape direction. One of the key issues in the field of high-efficiency can improve the police’s rapid response capability and overall command and combat capability in handling cases. Therefore, it has become a necessary and inevitable trend to introduce the core function of GIS-spatial analysis into the video surveillance system [7].
The end result of the underlying data curation is a road network dataset with an advanced connectivity model capable of showing complex details such as multi-model transportation networks, and a rich model of network properties. This model can help model network resistance, network constraints, and network layers. Such network datasets are constructed from simple features such as lines and points and turns. Network datasets provide a powerful management structure for creating, editing and maintaining network data [8]. In the research, a network dataset will be established at all levels of roads and camera locations in Kaifeng, to maintain the connectivity of network elements, and to build an appropriate network model, which will lay a solid foundation for the subsequent optimal path analysis and the realization of the real-time tracking monitoring module. basis [9].
Anti-terrorism and anti-riot robot is a kind of high-tech equipment that integrates mechanical, electronic, sensor, computer, intelligent control, communication and signal processing technology, robotics and other scientific knowledge [10]. It is a high-tech intelligent platform. It can operate reliably for a long time in various complex environments, such as radiation sites, high temperature environments, rugged terrain, etc., and has incomparable advantages in tasks such as reconnaissance, surveillance, and danger elimination. After expanding the automatic targeting system of weapons, it can directly attack terrorists and complete the anti-terrorist task more effectively. At the same time, it can stay in a hidden place for a long time. When terrorists appear, automatic long-range sniping of terrorists can be realized. In addition, its extended robotic arm grasping system can exclude dangerous items when entering the working environment, preventing the possibility of the danger caused by the relevant personnel directly eliminating it [11]. The research and design of new anti-terrorism and anti-riot robots can not only help relevant personnel eliminate hidden safety hazards, combat the arrogance of terrorism, and create a safe social environment, but also improve national competitiveness and comprehensive strength in science and technology. Science and technology are the primary productive forces. In the future, the competition between countries will mainly be based on scientific and technological strength, and the future war will be unmanned war. Robots will be widely used in the military field [12], then national defense technology will become the decisive factor for the outcome of wars. Therefore, it is also necessary to develop intelligent anti-terrorism and anti-riot robots to lay a solid scientific and technological foundation for future military competition [13].
The development of video surveillance system is getting faster and faster. In the beginning, most of them were single-view monitoring. For the areas that need to be monitored, special operators are responsible for managing and viewing the surveillance video. With the rapid development of social life in recent years, the requirements for monitoring systems are increasing day by day. A large number of cameras are used in all aspects of social life. Security, bank monitoring, campus community security, transportation, and smart city construction all require a powerful monitoring system to complete. A large number of cameras bring a large amount of data, and how to apply these data to contribute to human social life is also an urgent problem to be solved [14]. Due to the improvement of the processing speed of the computer itself and the change of the nature of social work, people’s requirements for computer intelligence are also higher. It is the development trend of the current monitoring system to intelligently complete many complicated monitoring tasks in these massive data. Traditional video surveillance systems are mostly centralized surveillance processing platforms [15]. When multiple cameras work at the same time, only the cameras distributed in different locations and different perspectives are responsible for monitoring their own areas, and the video processing work often requires a central system to complete, which increases the network bandwidth requirements, and at the same time, because there is no data exchange and fusion between cameras, a lot of usable information is lost virtually. This results in low utilization and waste of information resources [16].
When carrying out facial recognition work, the application of dynamic facial recognition in security monitoring systems is mainly reflected in the fact that the faces in the picture are always in a non cooperative state, and the monitoring camera is required to respond in a timely manner. Based on the current situation of dynamic facial recognition monitoring systems, the most common device is the high-definition facial recognition monitoring system. Its main application process is to compare the field of view of the monitoring device with the relevant facial data in the black-and-white list database of the deployed personnel, and determine whether the personnel within the field of view are deployed or key personnel. During the operation of this device, automatic analysis can be carried out by actively collecting video images. By comparing faces in real-time and dynamically, the alarm system will be automatically activated when a specific target is detected. At this time, relevant personnel in the monitoring room or command center can pay attention to the alarm personnel through relevant control measures. Based on the current application of dynamic facial recognition monitoring systems, the data capacity of the blacklist/control personnel list database involved can reach hundreds of thousands. The currently common facial recognition monitoring systems are mainly applied in public places such as subways, airports, and train stations. By deploying dynamic facial recognition monitoring systems at the main entrances and exits of these key places, it can effectively meet the security and prevention needs of public security departments, civil aviation, railways and other organizational units. At the same time, this dynamic facial recognition system can also be widely used in household safety and other aspects, providing protection for people’s lives and property.
Video surveillance is a common manifestation of security systems. With continuous development, security systems have now achieved digital and intelligent development. The equipment used is constantly being updated and the functions are becoming more and more perfect, which has been welcomed by the monitoring industry. However, in terms of the application effect of security systems, it is not possible to confirm the physical objects in the monitoring scene. In the entire monitoring scene, characters are the main object and are the core of judging the monitoring scene. By monitoring the characters, alarms are triggered, and external chain sensors are used to judge the physical objects in the scene. However, it cannot effectively identify the changes in the activities of the characters in the video.
To improve the real-time monitoring effect of public safety, a single-stage human object detection and pose estimation network based on thermal infrared images has been designed, which can simultaneously complete both human object detection and pose estimation tasks through multitasking. Due to the limited onboard computing resources of the tracking system, a knowledge distillation strategy is adopted to train a lightweight model, further reducing the number of model parameters to improve inference speed. Therefore, it is possible to achieve human target detection and tracking in both time and space, and effectively improve the system’s computational speed.
A real-time human target detection and pose estimation network based on thermal infrared images was designed for the prevention and control tracking system. The network can simultaneously obtain the two-dimensional coordinates of human targets and key points in the image through infrared images. On this basis, a knowledge distillation strategy is adopted to train a lightweight model, further improving the running speed on the model platform. To obtain the three-dimensional coordinates of human key points, a method for estimating the three-dimensional coordinates of human key points based on circular region sampling was designed. The three-dimensional coordinates of human key points were estimated using two-dimensional coordinates and depth images, and simulation experiments were conducted under laboratory conditions to verify the feasibility of the research content.
This paper combines three-dimensional face recognition technology and gesture recognition method to build a prevention and control tracking system to improve the prevention and control tracking effect.
Posture recognition for prevention and control tracking personnel
Representation and coding method of human body in thermal infrared image
The method of this article can be summarized as shown in Fig. 1.
Aiming at the characteristics of thermal infrared images, a method of using key points to uniformly represent human targets and two-dimensional human pose was designed. The target detection task and two-dimensional human pose estimation task were unified into a key point detection task. For this representation method, a key point encoding method based on thermal maps and associated embedded maps was designed.
Transforming two different types of tasks into the same or similar expressions for encoding can help improve the reuse rate of network models, thereby reducing the number of model parameters and improving running speed. Meanwhile, using a unified expression for different tasks helps to promote each other and improve the accuracy of both tasks. When encoding borders, transform the object detection problem into a keypoint estimation problem. Unify the representation of human body targets and key points. Specifically, for datasets with K-class human keypoints, the problem will be transformed into a keypoint estimation problem with K
The image to be detected may contain multiple human bodies, and each human body instance can be composed of a bounding box and a set of human pose information.
Human targets are usually represented by a rectangular frame, as shown in Fig. 1. The four sides of the rectangular border completely enclose all pixels in the image that belong to the same body. For a human target, it can be expressed as
The head of the human body is represented by 2 key points, the parts represented by these key points and their connection are shown in Fig. 2a.
Representation method of human target and human pose.
For a dataset with K classes of human key points, the problem is transformed into a key point estimation problem with K
When constructing the training set, the whole training set contains the annotation information of images and human targets and postures. For the training set
Among them,
The function of the key point heat map is to roughly locate each key point, which is usually generated by a two-dimensional Gaussian kernel function. For the key point
Among them, (
The function of the accuracy compensation map is to compensate the coarse positioning result to achieve the effect of precise positioning. Because the model downsamples the image during inference, the size of the model’s predictions often does not match the size of the original image. After the model downsamples the image, the transformation process of the key point g can be expressed as:
Among them,
The inference result
Since there may be a non-divisible relationship between the original coordinates (
The grid is the low-resolution pixel coordinate system after downsampling. The red circle is the key point is the coordinate
The working method of the network in this paper is shown in Fig. 3. The network is mainly divided into two important parts: the backbone network and the head network. The backbone network is constructed with an improved hourglass network, and the head network is designed based on multi-task learning. According to the characteristics of this network, this section designs a loss function for multi-task learning. In the model training process, the knowledge distillation strategy is used to train the lightweight model to improve the running speed of the model, which is used for deployment in the UAV platform with limited computing resources.
Schematic diagram of the working method of the network proposed in this paper.
Each hourglass module is divided into multiple stages to process feature maps of different scales, and the N-order hourglass module is denoted by
Hourglass module structure of different orders.
The designed head network module is a single-output multi-output structure, which is mainly divided into two branches: the human body posture branch and the target frame branch. The human body posture branch is responsible for the human key point estimation task, and the target detection branch is responsible for the detection task of the human target frame.
Loss function residual network structure.
The residual module is an important component of the hourglass module, and its effectiveness directly affects the performance of the entire network. Figure 5a shows the SandGlass module, which not only uses depthwise separable convolution to reduce computational complexity, but also adjusts the order of convolutions in the inverted residual module to improve module performance. After inputting the feature map into the SandGlass module, features are first extracted through a 3
The structure of the human target detection branch is shown in Fig. 5.
Human object detection of head network module.
As shown in Fig. 6, he branch has three branches to deal with the three border key points of the upper left corner, the lower right corner and the center point respectively. Among them, the structure of the output module is the same as that of the human body posture estimation branch. The branches responsible for the upper left corner and the lower right corner of the border have a Top-left corner pooling module and a Bottom-right corner pooling module respectively. The left figure of Fig. 6 shows the structure of the upper left pooling module. For the feature map F of the output module, first, the feature maps
Among them,
The designed model can detect the human target in the image and the two-dimensional posture of the human body at the same time, which belongs to the multi-task learning method. Therefore, a multi-task loss function is required to guide the model optimization during model training. The multi-task loss function designed in this section is:
The loss function is mainly composed of 4 parts, where
(1) Loss function of heat map of human key points
(2) Loss function of frame key point heatmap
Among them,
(3) Loss function of coordinate compensation
Equation (12) is the calculation method of the coordinate compensation loss
(4) Loss function of grouping
The backbone network of the teacher model set up in this paper consists of 6 hourglass modules. Since each hourglass module has the same format, each hourglass module can be connected to a complete header network module for output. The output of the head network module in the middle part is different from the final output of the model, and the output of the middle stage only plays a supervisory role in the model training stage. During the deployment and inference stages of the model, the head network modules in the intermediate stages can be cropped, and only the last head network module is retained as the final output. As the data is transmitted to the model backend, the location of key points will become more and more accurate. Therefore, when training the teacher model, a head network module is deployed for each hourglass module, and the network is optimized by a coarse-to-fine multi-branch supervision method, which helps to improve the effect of each stage of the network. Based on this supervision method, the loss function
Among them,
The task of the knowledge distillation stage is to train the student model. The schematic diagram of this stage is shown in Fig. 7. The backbone network of the student model is constructed with two hourglass modules. The loss function used in the knowledge distillation stage is different from the loss function used in the separate training model. It is calculated by the following formula:
The loss function is divided into two parts, one part is the loss value
Schematic diagram of the knowledge distillation method.
Schematic diagram of direct fusion of thermal infrared image and depth image.
As shown in Fig. 8, the resolution of thermal infrared image is 640
Therefore, if the corresponding pixels are obtained in the infrared image and the depth image, the two images need to be registered. The purpose of registration is to construct a transformation matrix
Among them,
Similarly, the internal parameter matrix of the depth camera is
The transformation matrix from the depth camera coordinate system to the infrared camera coordinate system is M, then the transformation relationship between the two can be expressed as:
Combining Eqs (20)–(23), we can get the relation:
Since the infrared camera and the depth camera are fixed together and the distance between them is very close,
Through the coordinate system conversion relationship, the conversion relationship from the depth camera coordinate system to the infrared camera coordinate system can be obtained as follows:
Therefore, the coordinate mapping relationship from the depth image to the thermal infrared image can be expressed as:
A set of two-dimensional human key points in the image is
For the point set
The closest point to point
The first
The points that do not meet this condition are replaced by
By performing the above operations on all the two-dimensional key points
Infrared facial images mainly depict the temperature distribution of the face, with skin temperatures near blood vessels slightly higher than those far away.
The skin temperature of blood vessels. Different faces have different temperature distributions. Infrared thermal imagers have high temperature sensitivity and can clearly express temperature distribution. Infrared recognition will not be affected by facial camouflage. If the distribution of blood vessels on the face remains unchanged, the results of infrared facial images will not change.
In the case of only one visible light camera for visible light facial recognition images, if using photos of the target object.
It is possible to achieve successful recognition. Infrared imaging has been introduced, as different photos can be easily distinguished from real people due to the difference between the photos and the real face of the thermal radiation model.
A single-stage human object detection and pose estimation network based on thermal infrared images, which can simultaneously complete both human object detection and pose estimation tasks through multitasking. Using knowledge distillation strategy to train lightweight models, further reducing the number of model parameters to improve inference speed.
In order to improve the prevention and tracking effect of 3D face recognition, this paper constructs the process of face capture for prevention and control as shown in Fig. 9a. Face recognition technology generally uses a face feature selection algorithm based on statistical learning, learning based on massive portrait data, and using the test set to test the impact of the selected different features on the portrait comparison effect while performing feature selection. The best portrait feature is automatically selected from a large number of portrait features according to the test results. Using the automatic optimal feature selection algorithm can reduce the dimension of the feature vector used to describe the portrait information, greatly improve the speed of portrait comparison, and make the algorithm suitable for massive portrait comparison scenes. The specific implementation process is shown in the Fig. 9b.
Prevention and control tracking system based on 3D face recognition.
As shown in Fig. 9b, facial images were captured by monitoring hardware arranged at various locations. A human model was constructed using the infrared imaging method and encoding method proposed in this paper. The collected facial images were compared with standard data and blacklist databases, and the collected threshold was compared with the standard threshold. If the standard threshold was exceeded, it may indicate that a task affecting public safety has been discovered. An alarm device was triggered to notify security personnel to take appropriate strategies.
For autonomous crossing of narrow passages in any direction. Due to the constraint of the visual angle of the visual sensor, it is not possible to obtain a full view image of the rotor, so it is not possible to narrow the passage of irregularly shaped buildings. To address this issue, it is possible to equip the rotor system with more angle visual sensors and monitoring devices, enabling the system to obtain a sufficiently large sensory field of view, and further research on detection and control strategies for narrow passages in different orientations, enhancing the detection and crossing capabilities of the rotor system for more complex narrow passages.
The experimental environment for this article is as follows:
This system uses PIH-301 gimbal, with motion parameters of: horizontal rotatable angle of 350∘, vertical rotatable angle of
For video surveillance modules, choosing the appropriate development tool will greatly reduce the burden on developers. Developers commonly use three video capture development tools. They are VFW, SDK library functions provided by the acquisition card, and DirectShow, respectively.
The depth sensor model used is IntelRealSenseL515, which captures depth information through laser scanning and is suitable for use in low light scenarios. This chapter also designed and produced a gimbal device that can control the pitch angle of two cameras and increase their visual perception.
In order to test the effectiveness of the human target detection and pose estimation network designed in this paper, a dataset of thermal infrared human body was independently constructed. The algorithm proposed in this paper was compared with existing two-stage methods in experiments, and the results were analyzed.
The model construction and training are completed under the PyTorch deep learning framework. Before training, the data needs to be partitioned first. In this paper, 80% of the dataset is randomly used as the training set and 20% as the testing set. Using Adam as the optimizer and setting the initial learning rate to 0.0015, the learning rates were reduced to 0.00015 and 0.000015 respectively during the 400th and 520th training rounds.
The effect of the prevention and control tracking system based on 3D face recognition
Before conducting the experiment, it is necessary to calibrate the infrared camera and depth camera using the binocular camera calibration tool in Matlab, which can calculate the internal parameters and relative position relationship of the two cameras simultaneously. To verify the effectiveness of three-dimensional coordinate estimation of human key points, due to the lack of publicly available datasets of thermal infrared images and depth images, this paper conducted three-dimensional pose estimation experiments using self collected thermal infrared images and depth image data, and calibrated the three-dimensional coordinates of some human key points based on the two-dimensional key point coordinates of the human body in the depth map.
The effect of the prevention and control tracking system based on three-dimensional face recognition proposed in this paper is verified, the practical effect of the prevention and control system is counted, and the statistics are carried out through multiple groups of face recognition effects. Evaluate using expert evaluation methods and a percentage based scoring system, the face recognition prevention and control evaluation effect shown in Table 1 is obtained.
From Table 1, it can be seen that the expert evaluation of the prevention and control tracking effect on 3D facial recognition is above 80 points, and the recognition effect is relatively obvious.
Using 20% randomly partitioned images from a self built dataset as the test set, and comparing it with existing two-stage (YOLOv3
Comparison of method performance
Compared with the two-stage method YOLOv3
Through the prevention and control tracking effect test conducted in Table 1, it can be seen that the prevention and control tracking system based on 3D face recognition proposed in this paper can effectively improve the prevention and control tracking effect.
The system in this article adopts an embedded polyp construction method, which can promote the scalability of the system and apply the method in multiple fields, facilitating the further promotion and use of the method in this article.
In order to effectively crack down on illegal crimes, cut off some criminal suspects’ escape routes after the incident, and significantly reduce the number of wanted fugitives across the country, the Ministry of Public Security has attached great importance to the arrest of fugitives by local public security organs at all levels. By using infrared recognition methods, the face and human body shape can be directly recognized, overcoming the problem of traditional visual recognition methods being easily occluded They include information comparison, call monitoring of the fugitives’ communication tools, work on the fugitives’ social relations and family members, launching psychological offensives against the fugitives, and increasing publicity and publicity. Controlling and cracking down on crimes is the main target of the construction of the public security prevention and control system, while illegal behavior is a secondary aspect, which also includes assisting the public security organs in catching criminal suspects at large through the public security prevention and control system. This paper combines three-dimensional face recognition technology and posture recognition method to build a prevention and control tracking system. Through the prevention and control tracking effect test, it can be seen that the prevention and control tracking system based on 3D face recognition proposed in this paper can effectively improve the effect of prevention and control tracking.
The main method of this article is to use infrared recognition for facial and human pose recognition, overcoming the problem of traditional visual schemes being unable to cope with occlusions. The system in this article is greatly affected by temperature during application. If there is interference similar to human body temperature, it will affect the recognition effect. Therefore, subsequent research will combine infrared recognition and visual recognition technology to further improve the adaptability of the method in this article.
